Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-10-09 Thread DB Tsai
Nice to hear that your experiment is consistent to my assumption. The current L1/L2 will penalize the intercept as well which is not idea. I'm working on GLMNET in Spark using OWLQN, and I can exactly get the same solution as R but with scalability in # of rows and columns. Stay tuned! Sincerely,

Trouble running tests

2014-10-09 Thread Yana
Hi, apologies if I missed a FAQ somewhere. I am trying to submit a bug fix for the very first time. Reading instructions, I forked the git repo (at c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests. I run this: ./dev/run-tests _SQL_TESTS_ONLY=true and after a while get the

Introduction to Spark Blog

2014-10-09 Thread devl.development
Hi Spark community Having spent some time getting up to speed with the various Spark components in the core package, I've written a blog to help other newcomers and contributors. By no means am I a Spark expert so would be grateful for any advice, comments or edit suggestions. Thanks very much

Re: Trouble running tests

2014-10-09 Thread Nicholas Chammas
_RUN_SQL_TESTS needs to be true as well. Those two _... variables set get correctly when tests are run on Jenkins. They’re not meant to be manipulated directly by testers. Did you want to run SQL tests only locally? You can try faking being Jenkins by setting AMPLAB_JENKINS=true before calling

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
For performance, will foreign data format support, same as native ones? Thanks, James On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote: The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread Michael Armbrust
Yes, the foreign sources work is only about exposing a stable set of APIs for external libraries to link against (to avoid the spark assembly becoming a dependency mess). The code path these APIs use will be the same as that for datasources included in the core spark sql library. Michael On

Re: Trouble running tests

2014-10-09 Thread Michael Armbrust
Also, in general for SQL only changes it is sufficient to run sbt/sbt catatlyst/test sql/test hive/test. The hive/test part takes the longest, so I usually leave that out until just before submitting unless my changes are hive specific. On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Thanks for the feedback. For 1, there is an open patch: https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but they're faster to access otherwise. Matei On Oct 9, 2014, at 12:11 PM,

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for TorrentBroadcast, or if the broadcasts are bigger than some size. Matei On Oct 9, 2014, at 3:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks for the feedback. For 1, there is an open patch:

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
Sounds great, thanks! On Thu, Oct 9, 2014 at 2:22 PM, Michael Armbrust mich...@databricks.com wrote: Yes, the foreign sources work is only about exposing a stable set of APIs for external libraries to link against (to avoid the spark assembly becoming a dependency mess). The code path

spark-prs and mesos/spark-ec2

2014-10-09 Thread Nicholas Chammas
Does it make sense to point the Spark PR review board to read from mesos/spark-ec2 as well? PRs submitted against that repo may reference Spark JIRAs and need review just like any other Spark PR. Nick

[Spark SQL] Strange NPE in Spark SQL with Hive

2014-10-09 Thread Trident
Hi Community, I use Spark 1.0.2, using Spark SQL to do Hive SQL. When I run the following code in Spark Shell: val file = sc.textFile(./README.md) val count = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_+_) count.collect() ‍ Correct and no error!

[Spark SQL Continue] Sorry, it is not only limited in SQL, may due to network

2014-10-09 Thread Trident
Dear Community, Please ignore my last post about Spark SQL. When I run: val file = sc.textFile(./README.md) val count = file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_+_) count.collect() ‍ it happends too. is there any possible reason for

Re: Spark on Mesos 0.20

2014-10-09 Thread Fairiz Azizi
Hello, Sorry for the late reply. When I tried the LogQuery example this time, things now seem to be fine! ... 14/10/10 04:01:21 INFO scheduler.DAGScheduler: Stage 0 (collect at LogQuery.scala:80) finished in 0.429 s 14/10/10 04:01:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,