[jira] [Updated] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-12326: --- Assignee: Seth Hendrickson > Move GBT implementation from spark.mllib to spark.ml > > > Key: SPARK-12326 > URL: https://issues.apache.org/jira/browse/SPARK-12326 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Several improvements can be made to gradient boosted trees, but are not > possible without moving the GBT implementation to spark.ml (e.g. > rawPrediction column, feature importance). This Jira is for moving the > current GBT implementation to spark.ml, which will have roughly the following > steps: > 1. Copy the implementation to spark.ml and change spark.ml classes to use > that implementation. Current tests will ensure that the implementations learn > exactly the same models. > 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, > InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since > eventually all tree implementations will reside in spark.ml, the helper > classes should as well. > 3. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10610) Using AppName instead of AppId in the name of all metrics
[ https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179507#comment-15179507 ] Pete Robbins commented on SPARK-10610: -- I think the appId is an important piece of information when visualizing the metrics along with hostname, executorId etc. I'm writing a sink and reporter to push the metrics to Elasticsearch and I include these in the metrics types for better correlation. eg { "timestamp": "2016-03-03T15:58:31.903+", "hostName": "9.20.187.127" "applicationId": "app-20160303155742-0005", "executorId": "driver", "BlockManager_memory_maxMem_MB": 3933 } The appId and executorId I extract form the metric name. When the sink is instantiated I don't believe I have access to any Utils to obtain the current appId and executorId so I'm kind of relying on these being in the metric name for the moment. Is it possible to make appId, applicationName, executorId avaiable to me via some Utils function that I have access to in a metrics Sink? > Using AppName instead of AppId in the name of all metrics > - > > Key: SPARK-10610 > URL: https://issues.apache.org/jira/browse/SPARK-10610 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Yi Tian >Priority: Minor > > When we using {{JMX}} to monitor spark system, We have to configure the name > of target metrics in the monitor system. But the current name of metrics is > {{appId}} + {{executorId}} + {{source}} . So when the spark program > restarted, we have to update the name of metrics in the monitor system. > We should add an optional configuration property to control whether using the > appName instead of appId in spark metrics system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179491#comment-15179491 ] Mark Grover commented on SPARK-13670: - cc [~vanzin] as fyi. I ran into this when I was running {{dev/mima}}. It should have failed because I didn't have the tools jar and launcher.main() did throw an exception but the subshell ate it all up and the mima testing continued, eventually giving me a ton of errors since the exclusions list wasn't correctly populated. Anyways, there are a few ways I can think of fixing it, open to others as well: 1. We could just fix the symptom and have the dev/mima script, explicitly check for tools directory and tools jar in bash and break, if it's not present. However, a) that's just fixing the symptom, not the root cause, b) we already have those checks in launcher code. 2. We could have the command write the output to a file instead of reading it directly from a subshell. Then, that makes error handling easier. 3. We can have the subshell kill the parent process on error, essentially running the subshell like so: {code} ("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" || kill $$) {code} The side effect is that any other subshells spawned by the parent, i.e. spark-class will be terminated as well. Based on a quick look, I didn't see any other subshells though. Thoughts? Preferences? > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179485#comment-15179485 ] Saisai Shao commented on SPARK-13669: - Hi [~imranr], I think you fixed this issue (SPARK-9439) and have much knowledge on scheduler and blacklist things, would you please comment on this issue, thanks a lot. > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the actual problem is Spark’s scheduler is not aware of the > unavailability of external shuffle service, and will still assign the tasks > on to that nodes. The fix is to avoid assigning tasks on to that nodes. > Currently in the Spark, one related configuration is > “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be > worked in this scenario. This configuration is used to avoid same reattempt > task to run on the same executor. Also ways like MapReduce’s blacklist > mechanism may not handle this scenario, since all the reduce tasks will be > failed, so counting the failure tasks will equally mark all the executors as > “bad” one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Grover updated SPARK-13670: Description: There's a particular snippet in spark-class [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that runs the spark-launcher code in a subshell. {code} # The launcher library will print arguments separated by a NULL character, to allow arguments with # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") {code} The problem is that the if the launcher Main fails, this code still still returns success and continues, even though the top level script is marked {{set -e}}. This is because the launcher.Main is run within a subshell. was: There's a particular snippet in spark-class [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that runs the spark-launcher code in a subshell. {code} # The launcher library will print arguments separated by a NULL character, to allow arguments with # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") {code} The problem is that the if the launcher Main fails, this code still still returns success and continues, even though the top level script is marked {{set -e}}. This is because the launcher.Main is run within a subshell. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13670) spark-class doesn't bubble up error from launcher command
Mark Grover created SPARK-13670: --- Summary: spark-class doesn't bubble up error from launcher command Key: SPARK-13670 URL: https://issues.apache.org/jira/browse/SPARK-13670 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.0.0 Reporter: Mark Grover Priority: Minor There's a particular snippet in spark-class [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that runs the spark-launcher code in a subshell. {code} # The launcher library will print arguments separated by a NULL character, to allow arguments with # characters that would be otherwise interpreted by the shell. Read that in a while loop, populating # an array that will be used to exec the final command. CMD=() while IFS= read -d '' -r ARG; do CMD+=("$ARG") done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") {code} The problem is that the if the launcher Main fails, this code still still returns success and continues, even though the top level script is marked {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
Saisai Shao created SPARK-13669: --- Summary: Job will always fail in the external shuffle service unavailable situation Key: SPARK-13669 URL: https://issues.apache.org/jira/browse/SPARK-13669 Project: Spark Issue Type: Bug Components: Spark Core, YARN Reporter: Saisai Shao Currently we are running into an issue with Yarn work preserving enabled + external shuffle service. In the work preserving enabled scenario, the failure of NM will not lead to the exit of executors, so executors can still accept and run the tasks. The problem here is when NM is failed, external shuffle service is actually inaccessible, so reduce tasks will always complain about the “Fetch failure”, and the failure of reduce stage will make the parent stage (map stage) rerun. The tricky thing here is Spark scheduler is not aware of the unavailability of external shuffle service, and will reschedule the map tasks on the executor where NM is failed, and again reduce stage will be failed with “Fetch failure”, and after 4 retries, the job is failed. So here the actual problem is Spark’s scheduler is not aware of the unavailability of external shuffle service, and will still assign the tasks on to that nodes. The fix is to avoid assigning tasks on to that nodes. Currently in the Spark, one related configuration is “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be worked in this scenario. This configuration is used to avoid same reattempt task to run on the same executor. Also ways like MapReduce’s blacklist mechanism may not handle this scenario, since all the reduce tasks will be failed, so counting the failure tasks will equally mark all the executors as “bad” one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-13652. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.0.0 1.6.2 > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu >Assignee: Shixiong Zhu > Fix For: 1.6.2, 2.0.0 > > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
[ https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179447#comment-15179447 ] Nick Pentreath commented on SPARK-13639: For SPARK-13568, we can take one of two approaches: 1. Support imputing numerical DF columns, as well as imputing columns within a vector (itself a vector DF column); 2. Only support imputing numerical DF columns. For #1, we then need {{Statistics.colStats}} to support ignoring NaN as an option (agree it should definitely not be default behaviour). Potentially we could only support it at a lower level (perhaps within {{MultivariateOnlineSummarizer}}). For scikit-learn's Imputer, obviously it works on NaN vector elements, but since we are working with DataFrames, my initial idea was actually more along the lines of #2. The {{Imputer}} would tend to be among the early steps in a pipeline, before the relevant numerical columns were transformed into a vector. So #1 is not an absolute requirement IMO, though obviously it would be more efficient to compute all the col stats for a set of columns together, and I do think it makes sense to support vector input types in {{Imputer}} if possible. Open to ideas on SPARK-13568 also. > Statistics.colStats(rdd).mean and variance should handle NaN in the input > vectors > - > > Key: SPARK-13639 > URL: https://issues.apache.org/jira/browse/SPARK-13639 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: yuhao yang >Priority: Trivial > >val denseData = Array( > Vectors.dense(3.8, 0.0, 1.8), > Vectors.dense(1.7, 0.9, 0.0), > Vectors.dense(Double.NaN, 0, 0.0) > ) > val rdd = sc.parallelize(denseData) > println(Statistics.colStats(rdd).mean) > [NaN,0.3,0.6] > This is just a proposal for discussion on how to handle the NaN value in the > vectors. We can ignore the NaN value in the computation or just output NaN as > it is now as a warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13642) Properly handle signal kill of ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13642: Assignee: (was: Apache Spark) > Properly handle signal kill of ApplicationMaster > > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped >
[jira] [Commented] (SPARK-13642) Properly handle signal kill of ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179400#comment-15179400 ] Apache Spark commented on SPARK-13642: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/11512 > Properly handle signal kill of ApplicationMaster > > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped >
[jira] [Assigned] (SPARK-13642) Properly handle signal kill of ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13642: Assignee: Apache Spark > Properly handle signal kill of ApplicationMaster > > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao >Assignee: Apache Spark > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped >
[jira] [Updated] (SPARK-13642) Properly handle signal kill of ApplicationMaster
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13642: Summary: Properly handle signal kill of ApplicationMaster (was: Different finishing state between Spark and YARN) > Properly handle signal kill of ApplicationMaster > > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage,null} > 16/03/03
[jira] [Commented] (SPARK-13650) Usage of the window() function on DStream
[ https://issues.apache.org/jira/browse/SPARK-13650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179347#comment-15179347 ] Mario Briggs commented on SPARK-13650: -- Running locally on my MAC, here is the output {quote} [Stage 0:> (0 + 0) / 2]16/03/04 10:38:18 INFO VerifiableProperties: Verifying properties 16/03/04 10:38:18 INFO VerifiableProperties: Property auto.offset.reset is overridden to smallest 16/03/04 10:38:18 INFO VerifiableProperties: Property group.id is overridden to 16/03/04 10:38:18 INFO VerifiableProperties: Property zookeeper.connect is overridden to --- Time: 1457068098000 ms --- 10 [Stage 6:===> (2 + 0) / 3] [Stage 6:===> (2 + 0) / 3] {quote} The '10' is the output from the first batch interval at time '1457068098000 ms' . Thereafter, the only output for the next ~2 minutes is the 'stage 6'. A ^C fails to stop the app and need to do a 'kill -9 pid' > Usage of the window() function on DStream > - > > Key: SPARK-13650 > URL: https://issues.apache.org/jira/browse/SPARK-13650 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2, 1.6.0, 2.0.0 >Reporter: Mario Briggs >Priority: Minor > > Is there some guidance of the usage of the Window() function on DStream. Here > is my academic use-case for which it fails. > Standard word count > val ssc = new StreamingContext(sparkConf, Seconds(6)) > val messages = KafkaUtils.createDirectStream(...) > val words = messages.map(_._2).flatMap(_.split(" ")) > val window = words.window(Seconds(12), Seconds(6)) > window.count().print() > For the first batch interval it gives the count and then it hangs (inside the > unionRDD) > I say the above use-case is academic since one can achieve similar > fuctionality by using instead the more compact API >words.countByWindow(Seconds(12), Seconds(6)) > which works fine. > Is the first approach above not the intended way of using the .window() API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal
[ https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179339#comment-15179339 ] Zhong Wang commented on SPARK-13337: The current join method with usingColumns argument generates result like TableC. The limitation is that it doesn't support null-safe join. > DataFrame join-on-columns function should support null-safe equal > - > > Key: SPARK-13337 > URL: https://issues.apache.org/jira/browse/SPARK-13337 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Zhong Wang >Priority: Minor > > Currently, the join-on-columns function: > {code} > def join(right: DataFrame, usingColumns: Seq[String], joinType: String): > DataFrame > {code} > performs a null-insafe join. It would be great if there is an option for > null-safe join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
[ https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13668: Assignee: Apache Spark > Reorder filter/join predicates to short-circuit isNotNull checks > > > Key: SPARK-13668 > URL: https://issues.apache.org/jira/browse/SPARK-13668 > Project: Spark > Issue Type: Improvement >Reporter: Sameer Agarwal >Assignee: Apache Spark > > If a filter predicate or a join condition consists of `IsNotNull` checks, we > should reorder these checks such that these non-nullability checks are > evaluated before the rest of the predicates. > For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we > should rewrite this as `isNotNull(b) && a > 5` during physical plan > generation. > cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
[ https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13668: Assignee: (was: Apache Spark) > Reorder filter/join predicates to short-circuit isNotNull checks > > > Key: SPARK-13668 > URL: https://issues.apache.org/jira/browse/SPARK-13668 > Project: Spark > Issue Type: Improvement >Reporter: Sameer Agarwal > > If a filter predicate or a join condition consists of `IsNotNull` checks, we > should reorder these checks such that these non-nullability checks are > evaluated before the rest of the predicates. > For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we > should rewrite this as `isNotNull(b) && a > 5` during physical plan > generation. > cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
[ https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179318#comment-15179318 ] Apache Spark commented on SPARK-13668: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/11511 > Reorder filter/join predicates to short-circuit isNotNull checks > > > Key: SPARK-13668 > URL: https://issues.apache.org/jira/browse/SPARK-13668 > Project: Spark > Issue Type: Improvement >Reporter: Sameer Agarwal > > If a filter predicate or a join condition consists of `IsNotNull` checks, we > should reorder these checks such that these non-nullability checks are > evaluated before the rest of the predicates. > For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we > should rewrite this as `isNotNull(b) && a > 5` during physical plan > generation. > cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
[ https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-13668: --- Description: If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates. For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a > 5` during physical plan generation. cc [~nongli] [~yhuai] was: If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates. For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a` during physical plan generation. cc [~nongli] [~yhuai] > Reorder filter/join predicates to short-circuit isNotNull checks > > > Key: SPARK-13668 > URL: https://issues.apache.org/jira/browse/SPARK-13668 > Project: Spark > Issue Type: Improvement >Reporter: Sameer Agarwal > > If a filter predicate or a join condition consists of `IsNotNull` checks, we > should reorder these checks such that these non-nullability checks are > evaluated before the rest of the predicates. > For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we > should rewrite this as `isNotNull(b) && a > 5` during physical plan > generation. > cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks
Sameer Agarwal created SPARK-13668: -- Summary: Reorder filter/join predicates to short-circuit isNotNull checks Key: SPARK-13668 URL: https://issues.apache.org/jira/browse/SPARK-13668 Project: Spark Issue Type: Improvement Reporter: Sameer Agarwal If a filter predicate or a join condition consists of `IsNotNull` checks, we should reorder these checks such that these non-nullability checks are evaluated before the rest of the predicates. For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we should rewrite this as `isNotNull(b) && a` during physical plan generation. cc [~nongli] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13667) Support for specifying custom date format for date and timestamp types
Hyukjin Kwon created SPARK-13667: Summary: Support for specifying custom date format for date and timestamp types Key: SPARK-13667 URL: https://issues.apache.org/jira/browse/SPARK-13667 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Minor Currently, CSV data source does not support to parse date and timestamp types in custom format and infer the type of timestamp type in custom format. It looks quite many of users want this feature. It would be great to set custom date format. This was reported in spark-csv. https://github.com/databricks/spark-csv/issues/279 https://github.com/databricks/spark-csv/issues/262 https://github.com/databricks/spark-csv/issues/266 Currently I submitted a PR for this in spark-csv https://github.com/databricks/spark-csv/pull/280 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13647) also check if numeric value is within allowed range in _verify_type
[ https://issues.apache.org/jira/browse/SPARK-13647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13647. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11492 [https://github.com/apache/spark/pull/11492] > also check if numeric value is within allowed range in _verify_type > --- > > Key: SPARK-13647 > URL: https://issues.apache.org/jira/browse/SPARK-13647 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13626) SparkConf deprecation log messages are printed multiple times
[ https://issues.apache.org/jira/browse/SPARK-13626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13626: Assignee: Apache Spark > SparkConf deprecation log messages are printed multiple times > - > > Key: SPARK-13626 > URL: https://issues.apache.org/jira/browse/SPARK-13626 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > I noticed that if I have a deprecated config in my spark-defaults.conf, I'll > see multiple warnings when running, for example, spark-shell. I collected the > backtrace from when the messages are printed, and here's a few instances. The > first one is the only one I expect to be printed. > {noformat} > java.lang.Exception: > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at org.apache.spark.repl.Main$.(Main.scala:30) > {noformat} > The following ones are causing duplicate log messages and we should clean > those up: > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.setAll(SparkConf.scala:139) > at org.apache.spark.SparkConf.clone(SparkConf.scala:358) > at org.apache.spark.SparkContext.(SparkContext.scala:368) > at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98) > {noformat} > There are also a few more caused by the use of {{SparkConf.clone()}}. > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:59) > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13626) SparkConf deprecation log messages are printed multiple times
[ https://issues.apache.org/jira/browse/SPARK-13626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13626: Assignee: (was: Apache Spark) > SparkConf deprecation log messages are printed multiple times > - > > Key: SPARK-13626 > URL: https://issues.apache.org/jira/browse/SPARK-13626 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I noticed that if I have a deprecated config in my spark-defaults.conf, I'll > see multiple warnings when running, for example, spark-shell. I collected the > backtrace from when the messages are printed, and here's a few instances. The > first one is the only one I expect to be printed. > {noformat} > java.lang.Exception: > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at org.apache.spark.repl.Main$.(Main.scala:30) > {noformat} > The following ones are causing duplicate log messages and we should clean > those up: > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.setAll(SparkConf.scala:139) > at org.apache.spark.SparkConf.clone(SparkConf.scala:358) > at org.apache.spark.SparkContext.(SparkContext.scala:368) > at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98) > {noformat} > There are also a few more caused by the use of {{SparkConf.clone()}}. > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:59) > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13626) SparkConf deprecation log messages are printed multiple times
[ https://issues.apache.org/jira/browse/SPARK-13626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179276#comment-15179276 ] Apache Spark commented on SPARK-13626: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/11510 > SparkConf deprecation log messages are printed multiple times > - > > Key: SPARK-13626 > URL: https://issues.apache.org/jira/browse/SPARK-13626 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > I noticed that if I have a deprecated config in my spark-defaults.conf, I'll > see multiple warnings when running, for example, spark-shell. I collected the > backtrace from when the messages are printed, and here's a few instances. The > first one is the only one I expect to be printed. > {noformat} > java.lang.Exception: > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at org.apache.spark.repl.Main$.(Main.scala:30) > {noformat} > The following ones are causing duplicate log messages and we should clean > those up: > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.setAll(SparkConf.scala:139) > at org.apache.spark.SparkConf.clone(SparkConf.scala:358) > at org.apache.spark.SparkContext.(SparkContext.scala:368) > at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98) > {noformat} > There are also a few more caused by the use of {{SparkConf.clone()}}. > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:59) > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238) > {noformat} > {noformat} > java.lang.Exception: > at > org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682) > ... > at org.apache.spark.SparkConf.(SparkConf.scala:53) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179254#comment-15179254 ] Jeff Zhang commented on SPARK-13587: SPARK-13081 is almost done. But for this ticket, I definitely need your feedback and help, I am not heavy python user, so may miss something that python programmer concerns. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179253#comment-15179253 ] Juliet Hougland commented on SPARK-13587: - That is wonderful. Let me know if you'd like me to help work on it. It has been dangling on my mental todo list for a while. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179251#comment-15179251 ] Jeff Zhang commented on SPARK-13587: Actually I have created this ticket several weeks ago SPARK-13081 :) > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179250#comment-15179250 ] Juliet Hougland commented on SPARK-13587: - I made a comment related to this below. TLDR I think the suggested --py-env option could be encompassed an already needed --pyspark_python sort of option. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179247#comment-15179247 ] Juliet Hougland commented on SPARK-13587: - Currently the way users specify the workers' python interpreter is through the PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path to be specified by a cli flag. That is a current rough edge of using already installed envs on a cluster. If this was added as a cli flag, I could see valid options being 'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) and requiring a second flag to specify the requirements file. I think it helps prevent an explosion of flag for spark submit while helping handle a very important and often changed parameter for a job. What do you think of this? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179241#comment-15179241 ] Gayathri Murali commented on SPARK-13641: - [~xusen] I can remove c_feature from Vector Assembler code, but is the naming convention intentional? > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors
[ https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179227#comment-15179227 ] yuhao yang commented on SPARK-13639: We perhaps can find some other way for SPARK-13568. I just need to make sure the performance is acceptable. > Statistics.colStats(rdd).mean and variance should handle NaN in the input > vectors > - > > Key: SPARK-13639 > URL: https://issues.apache.org/jira/browse/SPARK-13639 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: yuhao yang >Priority: Trivial > >val denseData = Array( > Vectors.dense(3.8, 0.0, 1.8), > Vectors.dense(1.7, 0.9, 0.0), > Vectors.dense(Double.NaN, 0, 0.0) > ) > val rdd = sc.parallelize(denseData) > println(Statistics.colStats(rdd).mean) > [NaN,0.3,0.6] > This is just a proposal for discussion on how to handle the NaN value in the > vectors. We can ignore the NaN value in the computation or just output NaN as > it is now as a warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13665) Initial separation of concerns
[ https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13665: Assignee: Apache Spark (was: Michael Armbrust) > Initial separation of concerns > -- > > Key: SPARK-13665 > URL: https://issues.apache.org/jira/browse/SPARK-13665 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Blocker > > The goal here is to break apart: File Management, code to deal with specific > formats and query planning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13665) Initial separation of concerns
[ https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179224#comment-15179224 ] Apache Spark commented on SPARK-13665: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/11509 > Initial separation of concerns > -- > > Key: SPARK-13665 > URL: https://issues.apache.org/jira/browse/SPARK-13665 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > The goal here is to break apart: File Management, code to deal with specific > formats and query planning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13665) Initial separation of concerns
[ https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13665: Assignee: Michael Armbrust (was: Apache Spark) > Initial separation of concerns > -- > > Key: SPARK-13665 > URL: https://issues.apache.org/jira/browse/SPARK-13665 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > The goal here is to break apart: File Management, code to deal with specific > formats and query planning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179157#comment-15179157 ] huangyu edited comment on SPARK-13652 at 3/4/16 3:09 AM: - I think the doc is about fetching stream rather than sendRpc was (Author: huang_yuu): I think the docs are about fetching stream rather than sendRpc > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13666) Annoying warnings from SQLConf in log output
Marcelo Vanzin created SPARK-13666: -- Summary: Annoying warnings from SQLConf in log output Key: SPARK-13666 URL: https://issues.apache.org/jira/browse/SPARK-13666 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Marcelo Vanzin Priority: Minor Whenever I run spark-shell I get a bunch of warnings about SQL configuration: {noformat} 16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark SQL config in SQLConf: key = spark.yarn.driver.memoryOverhead, value = 26 16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark SQL config in SQLConf: key = spark.yarn.executor.memoryOverhead, value = 26 16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark SQL config in SQLConf: key = spark.executor.cores, value = 1 16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark SQL config in SQLConf: key = spark.executor.memory, value = 268435456 {noformat} That should't happen, since I'm not setting those values explicitly. They're either set internally by Spark or come from spark-defaults.conf. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects
[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13531: Assignee: (was: Apache Spark) > Some DataFrame joins stopped working with UnsupportedOperationException: No > size estimation available for objects > - > > Key: SPARK-13531 > URL: https://issues.apache.org/jira/browse/SPARK-13531 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: koert kuipers >Priority: Minor > > this is using spark 2.0.0-SNAPSHOT > dataframe df1: > schema: > {noformat}StructType(StructField(x,IntegerType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , obj#135: object, [if (input[0, object].isNullAt) > null else input[0, object].get AS x#128] > +- MapPartitions , createexternalrow(if (isnull(x#9)) null else > x#9), [input[0, object] AS obj#135] >+- WholeStageCodegen > : +- Project [_1#8 AS x#9] > : +- Scan ExistingRDD[_1#8]{noformat} > show: > {noformat}+---+ > | x| > +---+ > | 2| > | 3| > +---+{noformat} > dataframe df2: > schema: > {noformat}StructType(StructField(x,IntegerType,true), > StructField(y,StringType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else > y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get > AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > object].get, true) AS y#131] > +- WholeStageCodegen >: +- Project [_1#0 AS x#2,_2#1 AS y#3] >: +- Scan ExistingRDD[_1#0,_2#1]{noformat} > show: > {noformat}+---+---+ > | x| y| > +---+---+ > | 1| 1| > | 2| 2| > | 3| 3| > +---+---+{noformat} > i run: > df1.join(df2, Seq("x")).show > i get: > {noformat}java.lang.UnsupportedOperationException: No size estimation > available for objects. > at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323) > at > org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat} > now sure what changed, this ran about a week ago without issues (in our > internal unit tests). it is fully reproducible, however when i tried to > minimize the issue i could not reproduce it by just creating data frames in > the repl with the same contents, so it probably has something to do with way > these are created (from Row objects and StructTypes). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects
[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179201#comment-15179201 ] Apache Spark commented on SPARK-13531: -- User 'zuowang' has created a pull request for this issue: https://github.com/apache/spark/pull/11508 > Some DataFrame joins stopped working with UnsupportedOperationException: No > size estimation available for objects > - > > Key: SPARK-13531 > URL: https://issues.apache.org/jira/browse/SPARK-13531 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: koert kuipers >Priority: Minor > > this is using spark 2.0.0-SNAPSHOT > dataframe df1: > schema: > {noformat}StructType(StructField(x,IntegerType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , obj#135: object, [if (input[0, object].isNullAt) > null else input[0, object].get AS x#128] > +- MapPartitions , createexternalrow(if (isnull(x#9)) null else > x#9), [input[0, object] AS obj#135] >+- WholeStageCodegen > : +- Project [_1#8 AS x#9] > : +- Scan ExistingRDD[_1#8]{noformat} > show: > {noformat}+---+ > | x| > +---+ > | 2| > | 3| > +---+{noformat} > dataframe df2: > schema: > {noformat}StructType(StructField(x,IntegerType,true), > StructField(y,StringType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else > y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get > AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > object].get, true) AS y#131] > +- WholeStageCodegen >: +- Project [_1#0 AS x#2,_2#1 AS y#3] >: +- Scan ExistingRDD[_1#0,_2#1]{noformat} > show: > {noformat}+---+---+ > | x| y| > +---+---+ > | 1| 1| > | 2| 2| > | 3| 3| > +---+---+{noformat} > i run: > df1.join(df2, Seq("x")).show > i get: > {noformat}java.lang.UnsupportedOperationException: No size estimation > available for objects. > at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323) > at > org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat} > now sure what changed, this ran about a week ago without issues (in our > internal unit tests). it is fully reproducible, however when i tried to > minimize the issue i could not reproduce it by just creating data frames in > the repl with the same contents, so it probably has something to do with way > these are created (from Row objects and StructTypes). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects
[ https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13531: Assignee: Apache Spark > Some DataFrame joins stopped working with UnsupportedOperationException: No > size estimation available for objects > - > > Key: SPARK-13531 > URL: https://issues.apache.org/jira/browse/SPARK-13531 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: koert kuipers >Assignee: Apache Spark >Priority: Minor > > this is using spark 2.0.0-SNAPSHOT > dataframe df1: > schema: > {noformat}StructType(StructField(x,IntegerType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , obj#135: object, [if (input[0, object].isNullAt) > null else input[0, object].get AS x#128] > +- MapPartitions , createexternalrow(if (isnull(x#9)) null else > x#9), [input[0, object] AS obj#135] >+- WholeStageCodegen > : +- Project [_1#8 AS x#9] > : +- Scan ExistingRDD[_1#8]{noformat} > show: > {noformat}+---+ > | x| > +---+ > | 2| > | 3| > +---+{noformat} > dataframe df2: > schema: > {noformat}StructType(StructField(x,IntegerType,true), > StructField(y,StringType,true)){noformat} > explain: > {noformat}== Physical Plan == > MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else > y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get > AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, > object].get, true) AS y#131] > +- WholeStageCodegen >: +- Project [_1#0 AS x#2,_2#1 AS y#3] >: +- Scan ExistingRDD[_1#0,_2#1]{noformat} > show: > {noformat}+---+---+ > | x| y| > +---+---+ > | 1| 1| > | 2| 2| > | 3| 3| > +---+---+{noformat} > i run: > df1.join(df2, Seq("x")).show > i get: > {noformat}java.lang.UnsupportedOperationException: No size estimation > available for objects. > at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323) > at > org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat} > now sure what changed, this ran about a week ago without issues (in our > internal unit tests). it is fully reproducible, however when i tried to > minimize the issue i could not reproduce it by just creating data frames in > the repl with the same contents, so it probably has something to do with way > these are created (from Row objects and StructTypes). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179157#comment-15179157 ] huangyu edited comment on SPARK-13652 at 3/4/16 2:53 AM: - I think the docs are about fetching stream rather than sendRpc was (Author: huang_yuu): I think it is about fetching stream rather than sendRpc > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13665) Initial separation of concerns
Michael Armbrust created SPARK-13665: Summary: Initial separation of concerns Key: SPARK-13665 URL: https://issues.apache.org/jira/browse/SPARK-13665 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker The goal here is to break apart: File Management, code to deal with specific formats and query planning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13664) Simplify and Speedup HadoopFSRelation
Michael Armbrust created SPARK-13664: Summary: Simplify and Speedup HadoopFSRelation Key: SPARK-13664 URL: https://issues.apache.org/jira/browse/SPARK-13664 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, however there are currently several complexity and performance problems with this code path: - The class mixes the concerns of file management, schema reconciliation, scan building, bucketing, partitioning, and writing data. - For very large tables, we are broadcasting the entire list of files to every executor. [SPARK-11441] - For partitioned tables, we always do an extra projection. This results not only in a copy, but undoes much of the performance gains that we are going to get from vectorized reads. This is an umbrella ticket to track a set of improvements to this codepath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1
Ted Yu created SPARK-13663: -- Summary: Upgrade Snappy Java to 1.1.2.1 Key: SPARK-13663 URL: https://issues.apache.org/jira/browse/SPARK-13663 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ted Yu Priority: Minor The JVM memory leaky problem reported in https://github.com/xerial/snappy-java/issues/131 has been resolved. 1.1.2.1 was released on Jan 22nd. We should upgrade to this release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179186#comment-15179186 ] huangyu edited comment on SPARK-13652 at 3/4/16 2:39 AM: - I used your patch and it worked, great!! was (Author: huang_yuu): I use your patch and it worked, great!! > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179186#comment-15179186 ] huangyu commented on SPARK-13652: - I use your patch and it worked, great!! > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179180#comment-15179180 ] Gayathri Murali commented on SPARK-13629: - [~josephkb] In the case of discrete probabilistic models in which this is most applicable, would the min_df and max_df thresholds change? or only the word count should be set to 0 or 1 depending on the value of binary? > Add binary toggle Param to CountVectorizer > -- > > Key: SPARK-13629 > URL: https://issues.apache.org/jira/browse/SPARK-13629 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It would be handy to add a binary toggle Param to CountVectorizer, as in the > scikit-learn one: > [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html] > If set, then all non-zero counts will be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12221) Add CPU time metric to TaskMetrics
[ https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179173#comment-15179173 ] Jisoo Kim commented on SPARK-12221: --- I made this backward-compatible, so it should be able to read history written with earlier versions. > Add CPU time metric to TaskMetrics > -- > > Key: SPARK-12221 > URL: https://issues.apache.org/jira/browse/SPARK-12221 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 1.5.2 >Reporter: Jisoo Kim > > Currently TaskMetrics doesn't support executor CPU time. I'd like to have one > so I can retrieve the metric from History Server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179157#comment-15179157 ] huangyu commented on SPARK-13652: - I think it is about fetching stream rather than sendRpc > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13642) Different finishing state between Spark and YARN
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179102#comment-15179102 ] Saisai Shao edited comment on SPARK-13642 at 3/4/16 1:45 AM: - Hi [~tgraves], thanks a lot for your comments. I think here {{stop()}} is not called explicitly by the user's code when app is finished, this is triggered in shutdown hook when receiving a signal. If the AM is terminated by signal, except deliberately killing, the application is actually in the middle of runtime, so does it make sense to mark it as successful? In this specific case, we expect another attempt but yarn fails to do it. Personally I think we should handle this situation properly. was (Author: jerryshao): Hi [~tgraves], thanks a lot for your comments. I think here {{stop()}} is not called explicitly by the user's code when app is finished, this is triggered in shutdown hook when receiving a signal. Here in ApplicationMaster, if user's class is finished, either by calling {{System.exit()}} or to the end of main function, we will mark this application finished with success, you could see [here|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L553]. If the AM is terminated by signal, except deliberately killing, the application is actually in the middle of runtime, so does it make sense to mark it as successful? In this specific case, we expect another attempt but yarn fails to do it. > Different finishing state between Spark and YARN > > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped >
[jira] [Resolved] (SPARK-13415) Visualize subquery in SQL web UI
[ https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-13415. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11417 [https://github.com/apache/spark/pull/11417] > Visualize subquery in SQL web UI > > > Key: SPARK-13415 > URL: https://issues.apache.org/jira/browse/SPARK-13415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Right now, uncorrelated scalar subqueries are not showed in SQL tab. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13642) Different finishing state between Spark and YARN
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13642: Summary: Different finishing state between Spark and YARN (was: Inconsistent finishing state between driver and AM ) > Different finishing state between Spark and YARN > > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/pool,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage,null} >
[jira] [Commented] (SPARK-13642) Inconsistent finishing state between driver and AM
[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179102#comment-15179102 ] Saisai Shao commented on SPARK-13642: - Hi [~tgraves], thanks a lot for your comments. I think here {{stop()}} is not called explicitly by the user's code when app is finished, this is triggered in shutdown hook when receiving a signal. Here in ApplicationMaster, if user's class is finished, either by calling {{System.exit()}} or to the end of main function, we will mark this application finished with success, you could see [here|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L553]. If the AM is terminated by signal, except deliberately killing, the application is actually in the middle of runtime, so does it make sense to mark it as successful? In this specific case, we expect another attempt but yarn fails to do it. > Inconsistent finishing state between driver and AM > --- > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should mark this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(3L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > The log normally like this: > {noformat} > 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : > 192.168.0.105:45454 > 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2 > 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor > NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1 > 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager > 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468) > 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready > for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 > 16/03/03 12:44:23 INFO YarnClusterScheduler: > YarnClusterScheduler.postStartHook done > 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM > 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/metrics/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/stages/stage/kill,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/api,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/static,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/threadDump,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/executors,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment/json,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/environment,null} > 16/03/03 12:44:39 INFO ContextHandler: stopped > o.e.j.s.ServletContextHandler{/storage/rdd/json,null} > 16/03/03 12:44:39 INFO
[jira] [Commented] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()
[ https://issues.apache.org/jira/browse/SPARK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179078#comment-15179078 ] Apache Spark commented on SPARK-13601: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11507 > Invoke task failure callbacks before calling outputstream.close() > - > > Key: SPARK-13601 > URL: https://issues.apache.org/jira/browse/SPARK-13601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > We need to submit another PR against Spark to call the task failure callbacks > before Spark calls the close function on various output streams. > For example, we need to intercept an exception and call > TaskContext.markTaskFailed before calling close in the following code (in > PairRDDFunctions.scala): > {code} > Utils.tryWithSafeFinally { > while (iter.hasNext) { > val record = iter.next() > writer.write(record._1.asInstanceOf[AnyRef], > record._2.asInstanceOf[AnyRef]) > // Update bytes written metric every few records > maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, > recordsWritten) > recordsWritten += 1 > } > } { > writer.close() > } > {code} > Changes to Spark should include unit tests to make sure this always work in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12877: -- Assignee: Jeremy > TrainValidationSplit is missing in pyspark.ml.tuning > > > Key: SPARK-12877 > URL: https://issues.apache.org/jira/browse/SPARK-12877 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk >Assignee: Jeremy > Fix For: 2.0.0 > > > I was investingating progress in SPARK-10759 and I noticed that there is no > TrainValidationSplit class in pyspark.ml.tuning module. > Java/Scala's examples SPARK-10759 use > org.apache.spark.ml.tuning.TrainValidationSplit that is not available from > Python and this blocks SPARK-10759. > Does the class have different name in PySpark, maybe? Also, I couldn't find > any JIRA task to saying it need to be implemented. Is it by design that the > TrainValidationSplit estimator is not ported to PySpark? If not, that is if > the estimator needs porting then I would like to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11861) Expose feature importances API for decision trees
[ https://issues.apache.org/jira/browse/SPARK-11861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11861: -- Shepherd: Joseph K. Bradley Assignee: Seth Hendrickson Target Version/s: 2.0.0 > Expose feature importances API for decision trees > - > > Key: SPARK-11861 > URL: https://issues.apache.org/jira/browse/SPARK-11861 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > Feature importances should be added to decision trees leveraging the feature > importance implementation for Random Forests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package
[ https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13633: Assignee: Andrew Or (was: Apache Spark) > Move parser classes to o.a.s.sql.catalyst.parser package > > > Key: SPARK-13633 > URL: https://issues.apache.org/jira/browse/SPARK-13633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package
[ https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13633: Assignee: Apache Spark (was: Andrew Or) > Move parser classes to o.a.s.sql.catalyst.parser package > > > Key: SPARK-13633 > URL: https://issues.apache.org/jira/browse/SPARK-13633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package
[ https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179033#comment-15179033 ] Apache Spark commented on SPARK-13633: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11506 > Move parser classes to o.a.s.sql.catalyst.parser package > > > Key: SPARK-13633 > URL: https://issues.apache.org/jira/browse/SPARK-13633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12721) SQL generation support for script transformation
[ https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12721: Comment: was deleted (was: A PR is submitted https://github.com/apache/spark/pull/11503) > SQL generation support for script transformation > > > Key: SPARK-12721 > URL: https://issues.apache.org/jira/browse/SPARK-12721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists
[ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179016#comment-15179016 ] Seth Hendrickson edited comment on SPARK-13068 at 3/4/16 12:32 AM: --- I think these are good and valid points. I will give it some more thought. My concern is that the {{expectedType}} approach does not play nice with lists/numpy arrays/vectors. If {{expectedType=list}}, then we can cast numpy array to list, but if the numpy array dtype is float and Scala expects ints in the array, there will still be a Py4J classcast exception. Likewise, if someone passes [1,2,3] to an Array[Double] param in Scala, then we will get an exception. To me, it is a bit unsatisfying to provide a solution that works for one very common case, but still fails in other common cases. I'm fine with not adding subclasses of Param for each type, but I think the Param validator functions would provide a comprehensive solution to some of the issues we're seeing. There is another [Jira|https://issues.apache.org/jira/browse/SPARK-10009] open about pyspark params working with lists/numpy arrays/vectors so I think addressing this issue in a robust way is important. I would love to hear feedback and thoughts on this or alternative approaches. Thanks! was (Author: sethah): I think these are good and valid points. I will give it some more thought. My concern is that the {{expectedType}} approach does not play nice with lists/numpy arrays/vectors. If {{expectedType=list}}, then we can cast numpy array to list, but if the numpy array dtype is float and Scala expects ints in the array, there will still be a Py4J classcast exception. Likewise, if someone passes [1,2,3] to an Array[Double] param in Scala, then we will get an exception. To me, it is a bit unsatisfying to provide a solution that works for one very common case, but still fails in other common cases. I'm fine with not adding subclasses of Param for each type, but I think the Param validator functions would provide a comprehensive solution to some of the issues we're seeing. There is another [Jira|https://issues.apache.org/jira/browse/SPARK-10009] open about pyspark params working with lists/numpy arrays/vectors so I think addressing this issue in a robust way is important. I would love to hear feedback and others' thoughts and alternative approaches. Thanks! > Extend pyspark ml paramtype conversion to support lists > --- > > Key: SPARK-13068 > URL: https://issues.apache.org/jira/browse/SPARK-13068 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > In SPARK-7675 we added type conversion for PySpark ML params. We should > follow up and support param type conversion for lists and nested structures > as required. This blocks having all PySpark ML params having type information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?
[ https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13631: Assignee: (was: Apache Spark) > getPreferredLocations race condition in spark 1.6.0? > > > Key: SPARK-13631 > URL: https://issues.apache.org/jira/browse/SPARK-13631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Andy Sloane > > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside > getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs > instead of DAGScheduler directly. > I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly > flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our > threaded test case, though once in a while it passes. > The stack trace is huge, but starts like this: > Caused by: java.lang.NullPointerException: null > at > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406) > at > org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366) > at > org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545) > The full trace is available here: > https://gist.github.com/andy256/97611f19924bbf65cf49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12721) SQL generation support for script transformation
[ https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12721: Assignee: (was: Apache Spark) > SQL generation support for script transformation > > > Key: SPARK-12721 > URL: https://issues.apache.org/jira/browse/SPARK-12721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12721) SQL generation support for script transformation
[ https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179017#comment-15179017 ] Xiao Li commented on SPARK-12721: - A PR is submitted https://github.com/apache/spark/pull/11503 > SQL generation support for script transformation > > > Key: SPARK-12721 > URL: https://issues.apache.org/jira/browse/SPARK-12721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?
[ https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179019#comment-15179019 ] Apache Spark commented on SPARK-13631: -- User 'a1k0n' has created a pull request for this issue: https://github.com/apache/spark/pull/11505 > getPreferredLocations race condition in spark 1.6.0? > > > Key: SPARK-13631 > URL: https://issues.apache.org/jira/browse/SPARK-13631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Andy Sloane > > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside > getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs > instead of DAGScheduler directly. > I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly > flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our > threaded test case, though once in a while it passes. > The stack trace is huge, but starts like this: > Caused by: java.lang.NullPointerException: null > at > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406) > at > org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366) > at > org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545) > The full trace is available here: > https://gist.github.com/andy256/97611f19924bbf65cf49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12721) SQL generation support for script transformation
[ https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12721: Assignee: Apache Spark > SQL generation support for script transformation > > > Key: SPARK-12721 > URL: https://issues.apache.org/jira/browse/SPARK-12721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12721) SQL generation support for script transformation
[ https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179018#comment-15179018 ] Apache Spark commented on SPARK-12721: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11503 > SQL generation support for script transformation > > > Key: SPARK-12721 > URL: https://issues.apache.org/jira/browse/SPARK-12721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > > {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. > Please refer to SPARK-11012 for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?
[ https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13631: Assignee: Apache Spark > getPreferredLocations race condition in spark 1.6.0? > > > Key: SPARK-13631 > URL: https://issues.apache.org/jira/browse/SPARK-13631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Andy Sloane >Assignee: Apache Spark > > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside > getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs > instead of DAGScheduler directly. > I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly > flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our > threaded test case, though once in a while it passes. > The stack trace is huge, but starts like this: > Caused by: java.lang.NullPointerException: null > at > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406) > at > org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366) > at > org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545) > The full trace is available here: > https://gist.github.com/andy256/97611f19924bbf65cf49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists
[ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179016#comment-15179016 ] Seth Hendrickson commented on SPARK-13068: -- I think these are good and valid points. I will give it some more thought. My concern is that the {{expectedType}} approach does not play nice with lists/numpy arrays/vectors. If {{expectedType=list}}, then we can cast numpy array to list, but if the numpy array dtype is float and Scala expects ints in the array, there will still be a Py4J classcast exception. Likewise, if someone passes [1,2,3] to an Array[Double] param in Scala, then we will get an exception. To me, it is a bit unsatisfying to provide a solution that works for one very common case, but still fails in other common cases. I'm fine with not adding subclasses of Param for each type, but I think the Param validator functions would provide a comprehensive solution to some of the issues we're seeing. There is another [Jira|https://issues.apache.org/jira/browse/SPARK-10009] open about pyspark params working with lists/numpy arrays/vectors so I think addressing this issue in a robust way is important. I would love to hear feedback and others' thoughts and alternative approaches. Thanks! > Extend pyspark ml paramtype conversion to support lists > --- > > Key: SPARK-13068 > URL: https://issues.apache.org/jira/browse/SPARK-13068 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > In SPARK-7675 we added type conversion for PySpark ML params. We should > follow up and support param type conversion for lists and nested structures > as required. This blocks having all PySpark ML params having type information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179012#comment-15179012 ] Jeff Zhang commented on SPARK-13587: Thanks for the feedback [~j_houg]. Yes, that's why I make the virtualenv as application scope. Creating python package management tool for a cluster will be a big project and too heavy. And what does "--pyspark_python" mean ? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()
[ https://issues.apache.org/jira/browse/SPARK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179011#comment-15179011 ] Apache Spark commented on SPARK-13601: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11504 > Invoke task failure callbacks before calling outputstream.close() > - > > Key: SPARK-13601 > URL: https://issues.apache.org/jira/browse/SPARK-13601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > We need to submit another PR against Spark to call the task failure callbacks > before Spark calls the close function on various output streams. > For example, we need to intercept an exception and call > TaskContext.markTaskFailed before calling close in the following code (in > PairRDDFunctions.scala): > {code} > Utils.tryWithSafeFinally { > while (iter.hasNext) { > val record = iter.next() > writer.write(record._1.asInstanceOf[AnyRef], > record._2.asInstanceOf[AnyRef]) > // Update bytes written metric every few records > maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, > recordsWritten) > recordsWritten += 1 > } > } { > writer.close() > } > {code} > Changes to Spark should include unit tests to make sure this always work in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore
Evan Chan created SPARK-13662: - Summary: [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore Key: SPARK-13662 URL: https://issues.apache.org/jira/browse/SPARK-13662 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0, 1.5.2 Environment: All Reporter: Evan Chan Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or equivalently the HiveContext.tables method, returns a DataFrame with only two columns: the name of the table and whether it is temporary. It would be really nice to add support to return some extra information, such as: - Whether this table is Spark-only or a native Hive table - If spark-only, the name of the data source - potentially other properties The first two is really useful for BI environments connecting to multiple data sources and that work with both Hive and Spark. Some thoughts: - The SQL/HiveContext Catalog API might need to be expanded to return something like a TableEntry, rather than just a tuple of (name, temporary). - I believe there is a Hive Catalog/client API to get information about each table. I suppose one concern would be the speed of using this API. Perhaps there are other APis that can get this info faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?
[ https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178980#comment-15178980 ] Andy Sloane commented on SPARK-13631: - Hm. I see, it's just a coincidence that a thread tends to get scheduled from the thread pool while the shuffle task is running, and tries to schedule a new job based on the locations of the (currently in-process) shuffle tasks. There's no blocking wait on an RDD happening that's causing this job to kick off -- it's just trying to schedule it. It seems like we'd want to defer finding preferred locations on an RDD which is currently being computed, somehow, but the job planning seems to happen completely up front and there aren't any indicators in an individual RDD that it's presently being computed. Which means we are probably unnecessarily recomputing RDDs in this multithreaded scheme. > getPreferredLocations race condition in spark 1.6.0? > > > Key: SPARK-13631 > URL: https://issues.apache.org/jira/browse/SPARK-13631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Andy Sloane > > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside > getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs > instead of DAGScheduler directly. > I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly > flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our > threaded test case, though once in a while it passes. > The stack trace is huge, but starts like this: > Caused by: java.lang.NullPointerException: null > at > org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406) > at > org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366) > at > org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545) > The full trace is available here: > https://gist.github.com/andy256/97611f19924bbf65cf49 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178883#comment-15178883 ] Jeff Zhang commented on SPARK-13587: Thanks for letting me know this. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13659) Remove returnValues from BlockStore APIs
[ https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13659: Assignee: Apache Spark (was: Josh Rosen) > Remove returnValues from BlockStore APIs > > > Key: SPARK-13659 > URL: https://issues.apache.org/jira/browse/SPARK-13659 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Apache Spark > > In preparation for larger refactorings, I think that we should remove the > confusing returnValues() option from the BlockStore put() APIs: returning the > value is only useful in one place (caching) and in other situations, such as > block replication, it's simpler to put() and then get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13659) Remove returnValues from BlockStore APIs
[ https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13659: Assignee: Josh Rosen (was: Apache Spark) > Remove returnValues from BlockStore APIs > > > Key: SPARK-13659 > URL: https://issues.apache.org/jira/browse/SPARK-13659 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > In preparation for larger refactorings, I think that we should remove the > confusing returnValues() option from the BlockStore put() APIs: returning the > value is only useful in one place (caching) and in other situations, such as > block replication, it's simpler to put() and then get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13661) Avoid the copy of UnsafeRow in HashedRelation
Davies Liu created SPARK-13661: -- Summary: Avoid the copy of UnsafeRow in HashedRelation Key: SPARK-13661 URL: https://issues.apache.org/jira/browse/SPARK-13661 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu We usually build the HashedRelation on top of array of UnsafeRow, the copy could be avoided. The caller of HashedRelation need to do the copy if it's needed. Another approach could be making the copy() of UnsafeRow smart so that it know when should copy the bytes or not, this could be also useful for other components. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13659) Remove returnValues from BlockStore APIs
[ https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178869#comment-15178869 ] Apache Spark commented on SPARK-13659: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/11502 > Remove returnValues from BlockStore APIs > > > Key: SPARK-13659 > URL: https://issues.apache.org/jira/browse/SPARK-13659 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > In preparation for larger refactorings, I think that we should remove the > confusing returnValues() option from the BlockStore put() APIs: returning the > value is only useful in one place (caching) and in other situations, such as > block replication, it's simpler to put() and then get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13660) CommitFailureTestRelationSuite floods the logs with garbage
[ https://issues.apache.org/jira/browse/SPARK-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13660: - Labels: starter (was: ) > CommitFailureTestRelationSuite floods the logs with garbage > --- > > Key: SPARK-13660 > URL: https://issues.apache.org/jira/browse/SPARK-13660 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu > Labels: starter > > https://github.com/apache/spark/pull/11439 added a utility method > "testQuietly". We can use it for CommitFailureTestRelationSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13660) CommitFailureTestRelationSuite floods the logs with garbage
Shixiong Zhu created SPARK-13660: Summary: CommitFailureTestRelationSuite floods the logs with garbage Key: SPARK-13660 URL: https://issues.apache.org/jira/browse/SPARK-13660 Project: Spark Issue Type: Test Components: Tests Reporter: Shixiong Zhu https://github.com/apache/spark/pull/11439 added a utility method "testQuietly". We can use it for CommitFailureTestRelationSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178867#comment-15178867 ] Jeremy commented on SPARK-12877: Hi Xiangrui, Commenting! > TrainValidationSplit is missing in pyspark.ml.tuning > > > Key: SPARK-12877 > URL: https://issues.apache.org/jira/browse/SPARK-12877 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > Fix For: 2.0.0 > > > I was investingating progress in SPARK-10759 and I noticed that there is no > TrainValidationSplit class in pyspark.ml.tuning module. > Java/Scala's examples SPARK-10759 use > org.apache.spark.ml.tuning.TrainValidationSplit that is not available from > Python and this blocks SPARK-10759. > Does the class have different name in PySpark, maybe? Also, I couldn't find > any JIRA task to saying it need to be implemented. Is it by design that the > TrainValidationSplit estimator is not ported to PySpark? If not, that is if > the estimator needs porting then I would like to contribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage
[ https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-13584. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.0.0 > ContinuousQueryManagerSuite floods the logs with garbage > > > Key: SPARK-13584 > URL: https://issues.apache.org/jira/browse/SPARK-13584 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.0 > > > We should clean up the following outputs > {code} > [info] ContinuousQueryManagerSuite: > 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 0.0 (TID 1) > java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:81) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in > stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at > org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) > at scala.collection.AbstractIterator.to(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847) > at
[jira] [Updated] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit
[ https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-13635: --- Assignee: Liang-Chi Hsieh > Enable LimitPushdown optimizer rule because we have whole-stage codegen for > Limit > - > > Key: SPARK-13635 > URL: https://issues.apache.org/jira/browse/SPARK-13635 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.0.0 > > > LimitPushdown optimizer rule has been disabled due to no whole-stage codegen > for Limit. As we have whole-stage codegen for Limit now, we should enable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13657) Can't parse very long AND/OR expression
[ https://issues.apache.org/jira/browse/SPARK-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13657: Assignee: Davies Liu (was: Apache Spark) > Can't parse very long AND/OR expression > --- > > Key: SPARK-13657 > URL: https://issues.apache.org/jira/browse/SPARK-13657 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > CatalystQl can't parse an expression with hundreds of AND/OR [1], it will > fail as StackOverflow. > [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13657) Can't parse very long AND/OR expression
[ https://issues.apache.org/jira/browse/SPARK-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13657: Assignee: Apache Spark (was: Davies Liu) > Can't parse very long AND/OR expression > --- > > Key: SPARK-13657 > URL: https://issues.apache.org/jira/browse/SPARK-13657 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > CatalystQl can't parse an expression with hundreds of AND/OR [1], it will > fail as StackOverflow. > [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13657) Can't parse very long AND/OR expression
[ https://issues.apache.org/jira/browse/SPARK-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178862#comment-15178862 ] Apache Spark commented on SPARK-13657: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11501 > Can't parse very long AND/OR expression > --- > > Key: SPARK-13657 > URL: https://issues.apache.org/jira/browse/SPARK-13657 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > CatalystQl can't parse an expression with hundreds of AND/OR [1], it will > fail as StackOverflow. > [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13659) Remove returnValues from BlockStore APIs
Josh Rosen created SPARK-13659: -- Summary: Remove returnValues from BlockStore APIs Key: SPARK-13659 URL: https://issues.apache.org/jira/browse/SPARK-13659 Project: Spark Issue Type: Improvement Components: Block Manager Reporter: Josh Rosen Assignee: Josh Rosen In preparation for larger refactorings, I think that we should remove the confusing returnValues() option from the BlockStore put() APIs: returning the value is only useful in one place (caching) and in other situations, such as block replication, it's simpler to put() and then get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13658) BooleanSimplification rule is slow with large boolean expressions
Davies Liu created SPARK-13658: -- Summary: BooleanSimplification rule is slow with large boolean expressions Key: SPARK-13658 URL: https://issues.apache.org/jira/browse/SPARK-13658 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the optimizer rule BooleanSimplification take about 2 seconds (it use lots of sematicsEqual, which require copy the whole tree). It will great if we could speedup it. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13632) Create new o.a.s.sql.execution.commands package
[ https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13632. - Resolution: Fixed Fix Version/s: 2.0.0 > Create new o.a.s.sql.execution.commands package > --- > > Key: SPARK-13632 > URL: https://issues.apache.org/jira/browse/SPARK-13632 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies
[ https://issues.apache.org/jira/browse/SPARK-13577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13577: Assignee: Apache Spark > Allow YARN to handle multiple jars, archive when uploading Spark dependencies > - > > Key: SPARK-13577 > URL: https://issues.apache.org/jira/browse/SPARK-13577 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See parent bug for more details. > Before we remove assemblies from Spark, we need the YARN backend to > understand how to find and upload multiple jars containing the Spark code. as > a feature request made during spec review, we should also allow the Spark > code to be provided as an archive that would be uploaded as a single file to > the cluster, but exploded when downloaded to the containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies
[ https://issues.apache.org/jira/browse/SPARK-13577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13577: Assignee: (was: Apache Spark) > Allow YARN to handle multiple jars, archive when uploading Spark dependencies > - > > Key: SPARK-13577 > URL: https://issues.apache.org/jira/browse/SPARK-13577 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Marcelo Vanzin > > See parent bug for more details. > Before we remove assemblies from Spark, we need the YARN backend to > understand how to find and upload multiple jars containing the Spark code. as > a feature request made during spec review, we should also allow the Spark > code to be provided as an archive that would be uploaded as a single file to > the cluster, but exploded when downloaded to the containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies
[ https://issues.apache.org/jira/browse/SPARK-13577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178831#comment-15178831 ] Apache Spark commented on SPARK-13577: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/11500 > Allow YARN to handle multiple jars, archive when uploading Spark dependencies > - > > Key: SPARK-13577 > URL: https://issues.apache.org/jira/browse/SPARK-13577 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Marcelo Vanzin > > See parent bug for more details. > Before we remove assemblies from Spark, we need the YARN backend to > understand how to find and upload multiple jars containing the Spark code. as > a feature request made during spec review, we should also allow the Spark > code to be provided as an archive that would be uploaded as a single file to > the cluster, but exploded when downloaded to the containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing
[ https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178803#comment-15178803 ] Joseph K. Bradley commented on SPARK-12347: --- 1. As in the JIRA description, there would need to be some list of special examples requiring input. 2. If an example requires an input file, it should be one of the provided input datasets which comes with Spark, so I think it's OK to hard-code the names. > Write script to run all MLlib examples for testing > -- > > Key: SPARK-12347 > URL: https://issues.apache.org/jira/browse/SPARK-12347 > Project: Spark > Issue Type: Test > Components: ML, MLlib, PySpark, SparkR, Tests >Reporter: Joseph K. Bradley > > It would facilitate testing to have a script which runs all MLlib examples > for all languages. > Design sketch to ensure all examples are run: > * Generate a list of examples to run programmatically (not from a fixed list). > * Use a list of special examples to handle examples which require command > line arguments. > * Make sure data, etc. used are small to keep the tests quick. > This could be broken into subtasks for each language, though it would be nice > to provide a single script. > Not sure where the script should live; perhaps in {{bin/}}? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists
[ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178794#comment-15178794 ] Joseph K. Bradley commented on SPARK-13068: --- I like the idea of automatic casting when possible. A few comments: * I don't think we should add special subclasses of Param for each type. We had to do that in Scala to make it Java-friendly. Instead, the casting could be handled by a single method which takes the value and the expected type. * I don't want to go overboard on conversions. E.g., if a {{str}} is required, I don't think we should try casting to a string. (If a user calls {{setInputCol([1,2,3])}}, then they are probably making a mistake.) * I don't think we should add ParamValidators in Python at this time. We could consider doing it later, but it's really a separate issue. I'm also worried about having validation in 2 places since they could get out of synch. How does that sound? > Extend pyspark ml paramtype conversion to support lists > --- > > Key: SPARK-13068 > URL: https://issues.apache.org/jira/browse/SPARK-13068 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > > In SPARK-7675 we added type conversion for PySpark ML params. We should > follow up and support param type conversion for lists and nested structures > as required. This blocks having all PySpark ML params having type information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13599) Groovy-all ends up in spark-assembly if hive profile set
[ https://issues.apache.org/jira/browse/SPARK-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13599. - Resolution: Fixed Assignee: Steve Loughran Fix Version/s: 2.0.0 > Groovy-all ends up in spark-assembly if hive profile set > > > Key: SPARK-13599 > URL: https://issues.apache.org/jira/browse/SPARK-13599 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.5.0, 1.6.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.0.0 > > > If you do a build with {{-Phive,hive-thriftserver}} then the contents of > {{org.codehaus.groovy:groovy-all}} gets into the spark-assembly.jar > This bad because > * it makes the JAR bigger > * it makes the build longer > * it's an uber-JAR itself, so can include things (maybe even conflicting > things) > * It's something else that needs to be kept up to date security-wise -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13657) Can't parse very long AND/OR expression
Davies Liu created SPARK-13657: -- Summary: Can't parse very long AND/OR expression Key: SPARK-13657 URL: https://issues.apache.org/jira/browse/SPARK-13657 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu CatalystQl can't parse an expression with hundreds of AND/OR [1], it will fail as StackOverflow. [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13652: - Summary: TransportClient.sendRpcSync returns wrong results (was: spark netty network issue) > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13652) spark netty network issue
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178723#comment-15178723 ] Shixiong Zhu commented on SPARK-13652: -- Updated "Affects Version/s". This bug only exists in 1.6.0 and master > spark netty network issue > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13652) spark netty network issue
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13652: - Affects Version/s: (was: 1.5.2) (was: 1.5.1) (was: 1.4.1) (was: 1.3.1) (was: 1.2.2) > spark netty network issue > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13652) spark netty network issue
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-13652: - Affects Version/s: 1.2.2 1.3.1 1.4.1 > spark netty network issue > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.5.2, 1.6.0 >Reporter: huangyu > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap ())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org