Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.
Nice to hear that your experiment is consistent to my assumption. The current L1/L2 will penalize the intercept as well which is not idea. I'm working on GLMNET in Spark using OWLQN, and I can exactly get the same solution as R but with scalability in # of rows and columns. Stay tuned! Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Sep 29, 2014 at 11:45 AM, Yanbo Liang yanboha...@gmail.com wrote: Thank you for all your patient response. I can conclude that if the data is totally separable or over-fit occurs, weights may be different. And it also consistent with my experiment. I have evaluate two different dataset and the result as followed: Loss function: LogisticGradient Regularizer: L2 regParam: 1.0 numIterations: 1 (SGD) Dataset 1: spark-1.1.0/data/mllib/sample_binary_classification_data.txt # of classes: 2 # of samples: 100 # of features: 692 areaUnderROC of both SGD and LBFGS can reach nearly 1.0 Loss function of both optimization method converge nearly 1.7147811767900675E-5 (very very small) Weights of each optimization method is different but looks like multiple relationship (not very strict) just as what DB Tsai mention above. It might be the dataset is totally separable. Dataset 2: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#german.numer # of classes: 2 # of samples: 1000 # of features: 24 areaUnderROC of both SGD and LBFGS both are nearly 0.8 Loss function of both optimization method converge nearly 0.5367041390107519 Weights of each optimization method is just the same. 2014-09-29 16:05 GMT+08:00 DB Tsai dbt...@dbtsai.com: Can you check the loss of both LBFGS and SGD implementation? One reason maybe SGD doesn't converge well and you can see that by comparing both log-likelihoods. One other potential reason maybe the label of your training data is totally separable, so you can always increase the log-likelihood by multiply a constant to the weights. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com wrote: Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and test dataset. With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as training method and the same other parameters. The precisions of these two methods almost near 100% and AUCs are also near 1.0. As far as I know, the convex optimization problem will converge to the global minimum value. (We use SGD with mini batch fraction as 1.0) But I got two different weights vector? Is this expectation or make sense? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Trouble running tests
Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Introduction to Spark Blog
Hi Spark community Having spent some time getting up to speed with the various Spark components in the core package, I've written a blog to help other newcomers and contributors. By no means am I a Spark expert so would be grateful for any advice, comments or edit suggestions. Thanks very much here's the post. http://batchinsights.wordpress.com/2014/10/09/a-short-dive-into-apache-spark/ Dev -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Introduction-to-Spark-Blog-tp8718.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Trouble running tests
_RUN_SQL_TESTS needs to be true as well. Those two _... variables set get correctly when tests are run on Jenkins. They’re not meant to be manipulated directly by testers. Did you want to run SQL tests only locally? You can try faking being Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That should be simpler than futzing with the _... variables. Nick On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote: Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: will/when Spark/SparkSQL will support ORCFile format
For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote: The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be added more easily and systematically after this PR is merged. On 10/9/14 8:22 AM, James Yu wrote: Thanks Mark! I will keep eye on it. @Evan, I saw people use both format, so I really want to have Spark support ORCFile. On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com wrote: https://github.com/apache/spark/pull/2576 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com wrote: James, Michael at the meetup last night said there was some development activity around ORCFiles. I'm curious though, what are the pros and cons of ORCFiles vs Parquet? On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote: Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: will/when Spark/SparkSQL will support ORCFile format
Yes, the foreign sources work is only about exposing a stable set of APIs for external libraries to link against (to avoid the spark assembly becoming a dependency mess). The code path these APIs use will be the same as that for datasources included in the core spark sql library. Michael On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote: For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote: The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be added more easily and systematically after this PR is merged. On 10/9/14 8:22 AM, James Yu wrote: Thanks Mark! I will keep eye on it. @Evan, I saw people use both format, so I really want to have Spark support ORCFile. On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com wrote: https://github.com/apache/spark/pull/2576 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com wrote: James, Michael at the meetup last night said there was some development activity around ORCFiles. I'm curious though, what are the pros and cons of ORCFiles vs Parquet? On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote: Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Trouble running tests
Also, in general for SQL only changes it is sufficient to run sbt/sbt catatlyst/test sql/test hive/test. The hive/test part takes the longest, so I usually leave that out until just before submitting unless my changes are hive specific. On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get correctly when tests are run on Jenkins. They’re not meant to be manipulated directly by testers. Did you want to run SQL tests only locally? You can try faking being Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That should be simpler than futzing with the _... variables. Nick On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote: Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the following error: [info] ScalaTest [info] Run completed in 3 minutes, 37 seconds. [info] Total number of tests run: 224 [info] Suites: completed 19, aborted 0 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0 [info] All tests passed. [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM [error] Expected ID character [error] Not a valid command: hive-thriftserver [error] Expected project ID [error] Expected configuration [error] Expected ':' (if selecting a configuration) [error] Expected key [error] Not a valid key: hive-thriftserver [error] hive-thriftserver/test [error] ^ (I am running this without my changes) I have 2 questions: 1. How to fix this 2. Is there a best practice on what to fork so you start off with a good state? I'm wondering if I should sync the latest changes or go back to a label? thanks in advance -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: TorrentBroadcast slow performance
Thanks for the feedback. For 1, there is an open patch: https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but they're faster to access otherwise. Matei On Oct 9, 2014, at 12:11 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, Thanks to your answer, we've found the problem. It was on reverse IP resolution on the drivers we used (wrong configuration of the local bind9). Apparently, not being able to reverse-resolve the IP address of the nodes was the culprit of the 10s delay. We've hit two other secondary problems with TorrentBroadcast though, in case you're interested : 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a java.lang.OutOfMemoryError: Requested array size exceeds VM limit, which is not the case with HttpBroadcast (I guess HttpBroadcast splits the serialized variable in small chunks) 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from Http to Torrent triggers several OOM). Additionally, a question : while HttpBroadcast stores the broadcast pieces on disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use disk backend storage. Does it mean that HttpBroadcast can handle bigger broadcast out of memory ? If so, it's too bad that this design choice wasn't used for Torrent. That being said, hats off to the people in charge of the broadcast unloading wrt the lineage, this stuff works great ! Guillaume Maybe there is a firewall issue that makes it slow for your nodes to connect through the IP addresses they're configured with. I see there's this 10 second pause between Updated info of block broadcast_84_piece1 and ensureFreeSpace(4194304) called (where it actually receives the block). HTTP broadcast used only HTTP fetches from the executors to the driver, but TorrentBroadcast has connections between the executors themselves and between executors and the driver over a different port. Where are you running your driver app and nodes? Matei On Oct 7, 2014, at 11:42 AM, Davies Liu dav...@databricks.com wrote: Could you create a JIRA for it? maybe it's a regression after https://issues.apache.org/jira/browse/SPARK-3119. We will appreciate that if you could tell how to reproduce it. On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, I've had no answer to this on u...@spark.apache.org, so I post it on dev before filing a JIRA (in case the problem or solution is already identified) We've had some performance issues since switching to 1.1.0, and we finally found the origin : TorrentBroadcast seems to be very slow in our setting (and it became default with 1.1.0) The logs of a 4MB variable with TorrentBroadcast : (15s) 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 stored as bytes in memory (estimated size 171.6 KB, free 7.2 GB) 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block broadcast_84_piece1 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called with curMem=1401611984, maxMem=9168696115 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 stored as bytes in memory (estimated size 4.0 MB, free 7.2 GB) 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block broadcast_84_piece0 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast variable 84 took 15.202260006 s 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called with curMem=1405806288, maxMem=9168696115 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as values in memory (estimated size 4.2 MB, free 7.2 GB) (notice that a 10s lag happens after the Updated info of block broadcast_... and before the MemoryStore log And with HttpBroadcast (0.3s): 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast variable 147 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called with curMem=1373493232, maxMem=9168696115 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as values in memory (estimated size 4.2 MB, free 7.3 GB) 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found block broadcast_147 locally Since Torrent is supposed to perform much better than Http, we suspect a configuration error from our side, but are unable to pin it down. Does someone have any idea of the origin of the problem ? For now we're sticking with the HttpBroadcast workaround. Guillaume -- Guillaume PITEL, Président +33(0)626 222 431 eXenSa S.A.S. 41, rue Périer - 92120 Montrouge - FRANCE Tel +33(0)184 163 677 / Fax +33(0)972 283 705 - To unsubscribe,
Re: TorrentBroadcast slow performance
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for TorrentBroadcast, or if the broadcasts are bigger than some size. Matei On Oct 9, 2014, at 3:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks for the feedback. For 1, there is an open patch: https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but they're faster to access otherwise. Matei On Oct 9, 2014, at 12:11 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, Thanks to your answer, we've found the problem. It was on reverse IP resolution on the drivers we used (wrong configuration of the local bind9). Apparently, not being able to reverse-resolve the IP address of the nodes was the culprit of the 10s delay. We've hit two other secondary problems with TorrentBroadcast though, in case you're interested : 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a java.lang.OutOfMemoryError: Requested array size exceeds VM limit, which is not the case with HttpBroadcast (I guess HttpBroadcast splits the serialized variable in small chunks) 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from Http to Torrent triggers several OOM). Additionally, a question : while HttpBroadcast stores the broadcast pieces on disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use disk backend storage. Does it mean that HttpBroadcast can handle bigger broadcast out of memory ? If so, it's too bad that this design choice wasn't used for Torrent. That being said, hats off to the people in charge of the broadcast unloading wrt the lineage, this stuff works great ! Guillaume Maybe there is a firewall issue that makes it slow for your nodes to connect through the IP addresses they're configured with. I see there's this 10 second pause between Updated info of block broadcast_84_piece1 and ensureFreeSpace(4194304) called (where it actually receives the block). HTTP broadcast used only HTTP fetches from the executors to the driver, but TorrentBroadcast has connections between the executors themselves and between executors and the driver over a different port. Where are you running your driver app and nodes? Matei On Oct 7, 2014, at 11:42 AM, Davies Liu dav...@databricks.com wrote: Could you create a JIRA for it? maybe it's a regression after https://issues.apache.org/jira/browse/SPARK-3119. We will appreciate that if you could tell how to reproduce it. On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, I've had no answer to this on u...@spark.apache.org, so I post it on dev before filing a JIRA (in case the problem or solution is already identified) We've had some performance issues since switching to 1.1.0, and we finally found the origin : TorrentBroadcast seems to be very slow in our setting (and it became default with 1.1.0) The logs of a 4MB variable with TorrentBroadcast : (15s) 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 stored as bytes in memory (estimated size 171.6 KB, free 7.2 GB) 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block broadcast_84_piece1 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called with curMem=1401611984, maxMem=9168696115 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 stored as bytes in memory (estimated size 4.0 MB, free 7.2 GB) 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block broadcast_84_piece0 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast variable 84 took 15.202260006 s 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called with curMem=1405806288, maxMem=9168696115 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as values in memory (estimated size 4.2 MB, free 7.2 GB) (notice that a 10s lag happens after the Updated info of block broadcast_... and before the MemoryStore log And with HttpBroadcast (0.3s): 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast variable 147 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called with curMem=1373493232, maxMem=9168696115 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as values in memory (estimated size 4.2 MB, free 7.3 GB) 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found block broadcast_147 locally Since Torrent is supposed to perform much better than Http, we suspect a configuration error from our side, but are unable to pin it down. Does someone have any idea of the origin of the problem ? For now we're sticking with the HttpBroadcast workaround. Guillaume --
Re: will/when Spark/SparkSQL will support ORCFile format
Sounds great, thanks! On Thu, Oct 9, 2014 at 2:22 PM, Michael Armbrust mich...@databricks.com wrote: Yes, the foreign sources work is only about exposing a stable set of APIs for external libraries to link against (to avoid the spark assembly becoming a dependency mess). The code path these APIs use will be the same as that for datasources included in the core spark sql library. Michael On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote: For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote: The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be added more easily and systematically after this PR is merged. On 10/9/14 8:22 AM, James Yu wrote: Thanks Mark! I will keep eye on it. @Evan, I saw people use both format, so I really want to have Spark support ORCFile. On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com wrote: https://github.com/apache/spark/pull/2576 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com wrote: James, Michael at the meetup last night said there was some development activity around ORCFiles. I'm curious though, what are the pros and cons of ORCFiles vs Parquet? On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote: Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark-prs and mesos/spark-ec2
Does it make sense to point the Spark PR review board to read from mesos/spark-ec2 as well? PRs submitted against that repo may reference Spark JIRAs and need review just like any other Spark PR. Nick
[Spark SQL] Strange NPE in Spark SQL with Hive
Hi Community, I use Spark 1.0.2, using Spark SQL to do Hive SQL. When I run the following code in Spark Shell: val file = sc.textFile(./README.md) val count = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_+_) count.collect() Correct and no error! When I run the following code: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql(SHOW TABLES).collect().foreach(println) Correct and no error! But when I run: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql(SELECT COUNT(*) from uservisits).collect().foreach(println) It comes with some error messages. What I found was the following error: 14/10/09 19:47:34 ERROR Executor: Exception in task ID 4 java.lang.NullPointerException at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/10/09 19:47:34 INFO CoarseGrainedExecutorBackend: Got assigned task 5 14/10/09 19:47:34 INFO Executor: Running task ID 5 14/10/09 19:47:34 DEBUG BlockManager: Getting local block broadcast_1 14/10/09 19:47:34 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(true, true, false, true, 1) 14/10/09 19:47:34 DEBUG BlockManager: Getting block broadcast_1 from memory 14/10/09 19:47:34 INFO BlockManager: Found block broadcast_1 locally 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 14/10/09 19:47:34 DEBUG BlockFetcherIterator$BasicBlockFetcherIterator: Sending request for 2 blocks (2.5 KB) from node19:50868 14/10/09 19:47:34 DEBUG BlockMessageArray: Adding BlockMessage [type = 1, id = shuffle_0_0_1, level = null, data = null] 14/10/09 19:47:34 DEBUG BlockMessageArray: Added BufferMessage(id = 5, size = 34) 14/10/09 19:47:34 DEBUG BlockMessageArray: Adding BlockMessage [type = 1, id = shuffle_0_1_1, level = null, data = null] 14/10/09 19:47:34 DEBUG BlockMessageArray: Added BufferMessage(id = 6, size = 34) 14/10/09 19:47:34 DEBUG BlockMessageArray: Buffer list: 14/10/09 19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] 14/10/09 19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=34 cap=34] 14/10/09 19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] 14/10/09 19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=34 cap=34] 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote fetches in 2 ms 14/10/09 19:47:34 DEBUG BlockFetcherIterator$BasicBlockFetcherIterator: Got local blocks in 0 ms ms 14/10/09 19:47:34 ERROR Executor: Exception in task ID 5 java.lang.NullPointerException at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/10/09 19:47:34 INFO CoarseGrainedExecutorBackend: Got assigned task 6 14/10/09 19:47:34 INFO Executor: Running task ID 6 14/10/09 19:47:34 DEBUG BlockManager: Getting local block broadcast_1 14/10/09 19:47:34 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(true, true, false, true, 1) 14/10/09 19:47:34 DEBUG BlockManager: Getting block broadcast_1 from memory 14/10/09 19:47:34 INFO BlockManager: Found block broadcast_1 locally 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
[Spark SQL Continue] Sorry, it is not only limited in SQL, may due to network
Dear Community, Please ignore my last post about Spark SQL. When I run: val file = sc.textFile(./README.md) val count = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_+_) count.collect() it happends too. is there any possible reason for that? we make have some adjustment in network last night Chen Weikeng 14/10/09 20:45:23 ERROR Executor: Exception in task ID 1 java.lang.NullPointerException at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:116) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 14/10/09 20:45:23 INFO CoarseGrainedExecutorBackend: Got assigned task 2 14/10/09 20:45:23 INFO Executor: Running task ID 2 14/10/09 20:45:23 DEBUG BlockManager: Getting local block broadcast_0 14/10/09 20:45:23 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(true, true, false, true, 1) 14/10/09 20:45:23 DEBUG BlockManager: Getting block broadcast_0 from memory 14/10/09 20:45:23 INFO BlockManager: Found block broadcast_0 locally 14/10/09 20:45:23 DEBUG Executor: Task 2's epoch is 0 14/10/09 20:45:23 INFO HadoopRDD: Input split: file:/public/rdma14/app/spark-rdma/examples/src/main/resources/people.txt:16+16 14/10/09 20:45:23 ERROR Executor: Exception in task ID 2 java.lang.NullPointerException at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:116) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Re: Spark on Mesos 0.20
Hello, Sorry for the late reply. When I tried the LogQuery example this time, things now seem to be fine! ... 14/10/10 04:01:21 INFO scheduler.DAGScheduler: Stage 0 (collect at LogQuery.scala:80) finished in 0.429 s 14/10/10 04:01:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool defa 14/10/10 04:01:21 INFO spark.SparkContext: Job finished: collect at LogQuery.scala:80, took 12.802743914 s (10.10.10.10,FRED,GET http://images.com/2013/Generic.jpg HTTP/1.1) bytes=621 n=2 Not sure if this is the correct response for that example. Our mesos/spark builds have since been updated since I last wrote. Possibly, the JDK version was updated to 1.7.0_67 If you are using an older JDK, maybe try updating that? - Fi Fairiz Fi Azizi On Wed, Oct 8, 2014 at 7:54 AM, RJ Nowling rnowl...@gmail.com wrote: Yep! That's the example I was talking about. Is an error message printed when it hangs? I get : 14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140930-131734-1723727882-5050-1895-1 On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi code...@gmail.com wrote: Sure, could you point me to the example? The only thing I could find was https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/LogQuery.scala So do you mean running it like: MASTER=mesos://xxx*:5050* ./run-example LogQuery I tried that and I can see the job run and the tasks complete on the slave nodes, but the client process seems to hang forever, it's probably a different problem. BTW, only a dozen or so tasks kick off. I actually haven't done much with Scala and Spark (it's been all python). Fi Fairiz Fi Azizi On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling rnowl...@gmail.com wrote: I was able to reproduce it on a small 4 node cluster (1 mesos master and 3 mesos slaves) with relatively low-end specs. As I said, I just ran the log query examples with the fine-grained mesos mode. Spark 1.1.0 and mesos 0.20.1. Fairiz, could you try running the logquery example included with Spark and see what you get? Thanks! On Mon, Oct 6, 2014 at 8:07 PM, Fairiz Azizi code...@gmail.com wrote: That's what great about Spark, the community is so active! :) I compiled Mesos 0.20.1 from the source tarball. Using the Mapr3 Spark 1.1.0 distribution from the Spark downloads page (spark-1.1.0-bin-mapr3.tgz). I see no problems for the workloads we are trying. However, the cluster is small (less than 100 cores across 3 nodes). The workloads reads in just a few gigabytes from HDFS, via an ipython notebook spark shell. thanks, Fi Fairiz Fi Azizi On Mon, Oct 6, 2014 at 9:20 AM, Timothy Chen tnac...@gmail.com wrote: Ok I created SPARK-3817 to track this, will try to repro it as well. Tim On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling rnowl...@gmail.com wrote: I've recently run into this issue as well. I get it from running Spark examples such as log query. Maybe that'll help reproduce the issue. On Monday, October 6, 2014, Gurvinder Singh gurvinder.si...@uninett.no wrote: The issue does not occur if the task at hand has small number of map tasks. I have a task which has 978 map tasks and I see this error as 14/10/06 09:34:40 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140711-081617-711206558-5050-2543-5 Here is the log from the mesos-slave where this container was running. http://pastebin.com/Q1Cuzm6Q If you look for the code from where error produced by spark, you will see that it simply exit and saying in comments this should never happen, lets just quit :-) - Gurvinder On 10/06/2014 09:30 AM, Timothy Chen wrote: (Hit enter too soon...) What is your setup and steps to repro this? Tim On Mon, Oct 6, 2014 at 12:30 AM, Timothy Chen tnac...@gmail.com wrote: Hi Gurvinder, I tried fine grain mode before and didn't get into that problem. On Sun, Oct 5, 2014 at 11:44 PM, Gurvinder Singh gurvinder.si...@uninett.no wrote: On 10/06/2014 08:19 AM, Fairiz Azizi wrote: The Spark online docs indicate that Spark is compatible with Mesos 0.18.1 I've gotten it to work just fine on 0.18.1 and 0.18.2 Has anyone tried Spark on a newer version of Mesos, i.e. Mesos v0.20.0? -Fi Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in coarse mode, in fine grain mode there is an issue with blockmanager names conflict. I have been waiting for it to be fixed but it is still there. -Gurvinder - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: