Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-10-09 Thread DB Tsai
Nice to hear that your experiment is consistent to my assumption. The
current L1/L2 will penalize the intercept as well which is not idea.
I'm working on GLMNET in Spark using OWLQN, and I can exactly get the
same solution as R but with scalability in # of rows and columns. Stay
tuned!

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Sep 29, 2014 at 11:45 AM, Yanbo Liang yanboha...@gmail.com wrote:
 Thank you for all your patient response.

 I can conclude that if the data is totally separable or over-fit occurs,
 weights may be different.
 And it also consistent with my experiment.

 I have evaluate two different dataset and the result as followed:
 Loss function: LogisticGradient
 Regularizer: L2
 regParam: 1.0
 numIterations: 1 (SGD)

 Dataset 1: spark-1.1.0/data/mllib/sample_binary_classification_data.txt
 # of classes: 2
 # of samples: 100
 # of features: 692
 areaUnderROC of both SGD and LBFGS can reach nearly 1.0
 Loss function of both optimization method converge nearly
 1.7147811767900675E-5 (very very small)
 Weights of each optimization method is different but looks like multiple
 relationship (not very strict) just as what DB Tsai mention above.  It might
 be the dataset is totally separable.

 Dataset 2:
 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#german.numer
 # of classes: 2
 # of samples: 1000
 # of features: 24
 areaUnderROC of both SGD and LBFGS both are nearly 0.8
 Loss function of both optimization method converge nearly 0.5367041390107519
 Weights of each optimization method is just the same.



 2014-09-29 16:05 GMT+08:00 DB Tsai dbt...@dbtsai.com:

 Can you check the loss of both LBFGS and SGD implementation? One
 reason maybe SGD doesn't converge well and you can see that by
 comparing both log-likelihoods. One other potential reason maybe the
 label of your training data is totally separable, so you can always
 increase the log-likelihood by multiply a constant to the weights.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang yanboha...@gmail.com
 wrote:
  Hi
 
  We have used LogisticRegression with two different optimization method
  SGD
  and LBFGS in MLlib.
  With the same dataset and the same training and test split, but get
  different weights vector.
 
  For example, we use
  spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our
  training
  and test dataset.
  With LogisticRegressionWithSGD and LogisticRegressionWithLBFGS as
  training
  method and the same other parameters.
 
  The precisions of these two methods almost near 100% and AUCs are also
  near
  1.0.
  As far as I know, the convex optimization problem will converge to the
  global minimum value. (We use SGD with mini batch fraction as 1.0)
  But I got two different weights vector? Is this expectation or make
  sense?



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Trouble running tests

2014-10-09 Thread Yana
Hi, apologies if I missed a FAQ somewhere.

I am trying to submit a bug fix for the very first time. Reading
instructions, I forked the git repo (at
c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.

I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true

and after a while get the following error: 

[info] ScalaTest
[info] Run completed in 3 minutes, 37 seconds.
[info] Total number of tests run: 224
[info] Suites: completed 19, aborted 0
[info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
[info] All tests passed.
[info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
[success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
[error] Expected ID character
[error] Not a valid command: hive-thriftserver
[error] Expected project ID
[error] Expected configuration
[error] Expected ':' (if selecting a configuration)
[error] Expected key
[error] Not a valid key: hive-thriftserver
[error] hive-thriftserver/test
[error]  ^


(I am running this without my changes)

I have 2 questions:
1. How to fix this
2. Is there a best practice on what to fork so you start off with a good
state? I'm wondering if I should sync the latest changes or go back to a
label?

thanks in advance




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Introduction to Spark Blog

2014-10-09 Thread devl.development
Hi Spark community

Having spent some time getting up to speed with the various Spark components
in the core package, I've written a blog to help other newcomers and
contributors.

By no means am I a Spark expert so would be grateful for any advice,
comments or edit suggestions. 

Thanks very much here's the post.

http://batchinsights.wordpress.com/2014/10/09/a-short-dive-into-apache-spark/

Dev





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Introduction-to-Spark-Blog-tp8718.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Trouble running tests

2014-10-09 Thread Nicholas Chammas
_RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
correctly when tests are run on Jenkins. They’re not meant to be
manipulated directly by testers.

Did you want to run SQL tests only locally? You can try faking being
Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
should be simpler than futzing with the _... variables.

Nick
​

On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote:

 Hi, apologies if I missed a FAQ somewhere.

 I am trying to submit a bug fix for the very first time. Reading
 instructions, I forked the git repo (at
 c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.

 I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true

 and after a while get the following error:

 [info] ScalaTest
 [info] Run completed in 3 minutes, 37 seconds.
 [info] Total number of tests run: 224
 [info] Suites: completed 19, aborted 0
 [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
 [info] All tests passed.
 [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
 [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
 [error] Expected ID character
 [error] Not a valid command: hive-thriftserver
 [error] Expected project ID
 [error] Expected configuration
 [error] Expected ':' (if selecting a configuration)
 [error] Expected key
 [error] Not a valid key: hive-thriftserver
 [error] hive-thriftserver/test
 [error]  ^


 (I am running this without my changes)

 I have 2 questions:
 1. How to fix this
 2. Is there a best practice on what to fork so you start off with a good
 state? I'm wondering if I should sync the latest changes or go back to a
 label?

 thanks in advance




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
For performance, will foreign data format support, same as native ones?

Thanks,
James


On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote:

 The foreign data source API PR also matters here
 https://www.github.com/apache/spark/pull/2475

 Foreign data source like ORC can be added more easily and systematically
 after this PR is merged.

 On 10/9/14 8:22 AM, James Yu wrote:

 Thanks Mark! I will keep eye on it.

 @Evan, I saw people use both format, so I really want to have Spark
 support
 ORCFile.


 On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com
 wrote:

  https://github.com/apache/spark/pull/2576



 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com
 wrote:

  James,

 Michael at the meetup last night said there was some development
 activity around ORCFiles.

 I'm curious though, what are the pros and cons of ORCFiles vs Parquet?

 On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote:

 Didn't see anyone asked the question before, but I was wondering if

 anyone

 knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
 getting more and more popular hi Hive world.

 Thanks,
 James

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org






Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread Michael Armbrust
Yes, the foreign sources work is only about exposing a stable set of APIs
for external libraries to link against (to avoid the spark assembly
becoming a dependency mess).  The code path these APIs use will be the same
as that for datasources included in the core spark sql library.

Michael

On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote:

 For performance, will foreign data format support, same as native ones?

 Thanks,
 James


 On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com wrote:

  The foreign data source API PR also matters here
  https://www.github.com/apache/spark/pull/2475
 
  Foreign data source like ORC can be added more easily and systematically
  after this PR is merged.
 
  On 10/9/14 8:22 AM, James Yu wrote:
 
  Thanks Mark! I will keep eye on it.
 
  @Evan, I saw people use both format, so I really want to have Spark
  support
  ORCFile.
 
 
  On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com
  wrote:
 
   https://github.com/apache/spark/pull/2576
 
 
 
  On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com
  wrote:
 
   James,
 
  Michael at the meetup last night said there was some development
  activity around ORCFiles.
 
  I'm curious though, what are the pros and cons of ORCFiles vs Parquet?
 
  On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote:
 
  Didn't see anyone asked the question before, but I was wondering if
 
  anyone
 
  knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
  getting more and more popular hi Hive world.
 
  Thanks,
  James
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 



Re: Trouble running tests

2014-10-09 Thread Michael Armbrust
Also, in general for SQL only changes it is sufficient to run sbt/sbt
catatlyst/test sql/test hive/test.  The hive/test part takes the
longest, so I usually leave that out until just before submitting unless my
changes are hive specific.

On Thu, Oct 9, 2014 at 11:40 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 _RUN_SQL_TESTS needs to be true as well. Those two _... variables set get
 correctly when tests are run on Jenkins. They’re not meant to be
 manipulated directly by testers.

 Did you want to run SQL tests only locally? You can try faking being
 Jenkins by setting AMPLAB_JENKINS=true before calling run-tests. That
 should be simpler than futzing with the _... variables.

 Nick
 ​

 On Thu, Oct 9, 2014 at 10:10 AM, Yana yana.kadiy...@gmail.com wrote:

  Hi, apologies if I missed a FAQ somewhere.
 
  I am trying to submit a bug fix for the very first time. Reading
  instructions, I forked the git repo (at
  c9ae79fba25cd49ca70ca398bc75434202d26a97) and am trying to run tests.
 
  I run this: ./dev/run-tests  _SQL_TESTS_ONLY=true
 
  and after a while get the following error:
 
  [info] ScalaTest
  [info] Run completed in 3 minutes, 37 seconds.
  [info] Total number of tests run: 224
  [info] Suites: completed 19, aborted 0
  [info] Tests: succeeded 224, failed 0, canceled 0, ignored 5, pending 0
  [info] All tests passed.
  [info] Passed: Total 224, Failed 0, Errors 0, Passed 224, Ignored 5
  [success] Total time: 301 s, completed Oct 9, 2014 9:31:23 AM
  [error] Expected ID character
  [error] Not a valid command: hive-thriftserver
  [error] Expected project ID
  [error] Expected configuration
  [error] Expected ':' (if selecting a configuration)
  [error] Expected key
  [error] Not a valid key: hive-thriftserver
  [error] hive-thriftserver/test
  [error]  ^
 
 
  (I am running this without my changes)
 
  I have 2 questions:
  1. How to fix this
  2. Is there a best practice on what to fork so you start off with a good
  state? I'm wondering if I should sync the latest changes or go back to a
  label?
 
  thanks in advance
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Trouble-running-tests-tp8717.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Thanks for the feedback. For 1, there is an open patch: 
https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually use 
MEMORY_AND_DISK storage, so they will spill to disk if you have low memory, but 
they're faster to access otherwise.

Matei

On Oct 9, 2014, at 12:11 PM, Guillaume Pitel guillaume.pi...@exensa.com wrote:

 Hi,
 
 Thanks to your answer, we've found the problem. It was on reverse IP 
 resolution on the drivers we used (wrong configuration of the local bind9). 
 Apparently, not being able to reverse-resolve the IP address of the nodes was 
 the culprit of the 10s delay.
 
 We've hit two other secondary problems with TorrentBroadcast though, in case 
 you're interested  :
 
 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a 
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit, which is 
 not the case with HttpBroadcast (I guess HttpBroadcast splits the serialized 
 variable in small chunks)
 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from 
 Http to Torrent triggers several OOM).
 
 Additionally, a question : while HttpBroadcast stores the broadcast pieces on 
 disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use disk 
 backend storage. Does it mean that HttpBroadcast can handle bigger broadcast 
 out of memory ? If so, it's too bad that this design choice wasn't used for 
 Torrent.
 
 That being said, hats off to the people in charge of the broadcast unloading 
 wrt the lineage, this stuff works great !
 
 Guillaume
 
 
 Maybe there is a firewall issue that makes it slow for your nodes to connect 
 through the IP addresses they're configured with. I see there's this 10 
 second pause between Updated info of block broadcast_84_piece1 and 
 ensureFreeSpace(4194304) called (where it actually receives the block). 
 HTTP broadcast used only HTTP fetches from the executors to the driver, but 
 TorrentBroadcast has connections between the executors themselves and 
 between executors and the driver over a different port. Where are you 
 running your driver app and nodes?
 
 Matei
 
 On Oct 7, 2014, at 11:42 AM, Davies Liu dav...@databricks.com wrote:
 
 Could you create a JIRA for it? maybe it's a regression after
 https://issues.apache.org/jira/browse/SPARK-3119.
 
 We will appreciate that if you could tell how to reproduce it.
 
 On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
 guillaume.pi...@exensa.com wrote:
 Hi,
 
 I've had no answer to this on u...@spark.apache.org, so I post it on dev
 before filing a JIRA (in case the problem or solution is already 
 identified)
 
 We've had some performance issues since switching to 1.1.0, and we finally
 found the origin : TorrentBroadcast seems to be very slow in our setting
 (and it became default with 1.1.0)
 
 The logs of a 4MB variable with TorrentBroadcast : (15s)
 
 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 
 stored
 as bytes in memory (estimated size 171.6 KB, free 7.2 GB)
 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_84_piece1
 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) called
 with curMem=1401611984, maxMem=9168696115
 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 
 stored
 as bytes in memory (estimated size 4.0 MB, free 7.2 GB)
 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_84_piece0
 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast
 variable 84 took 15.202260006 s
 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) called
 with curMem=1405806288, maxMem=9168696115
 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as
 values in memory (estimated size 4.2 MB, free 7.2 GB)
 
 (notice that a 10s lag happens after the Updated info of block
 broadcast_... and before the MemoryStore log
 
 And with HttpBroadcast (0.3s):
 
 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast
 variable 147
 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) called
 with curMem=1373493232, maxMem=9168696115
 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as
 values in memory (estimated size 4.2 MB, free 7.3 GB)
 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable
 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found
 block broadcast_147 locally
 
 Since Torrent is supposed to perform much better than Http, we suspect a
 configuration error from our side, but are unable to pin it down. Does
 someone have any idea of the origin of the problem ?
 
 For now we're sticking with the HttpBroadcast workaround.
 
 Guillaume
 --
 Guillaume PITEL, Président
 +33(0)626 222 431
 
 eXenSa S.A.S.
 41, rue Périer - 92120 Montrouge - FRANCE
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705
 -
 To unsubscribe, 

Re: TorrentBroadcast slow performance

2014-10-09 Thread Matei Zaharia
Oops I forgot to add, for 2, maybe we can add a flag to use DISK_ONLY for 
TorrentBroadcast, or if the broadcasts are bigger than some size.

Matei

On Oct 9, 2014, at 3:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Thanks for the feedback. For 1, there is an open patch: 
 https://github.com/apache/spark/pull/2659. For 2, broadcast blocks actually 
 use MEMORY_AND_DISK storage, so they will spill to disk if you have low 
 memory, but they're faster to access otherwise.
 
 Matei
 
 On Oct 9, 2014, at 12:11 PM, Guillaume Pitel guillaume.pi...@exensa.com 
 wrote:
 
 Hi,
 
 Thanks to your answer, we've found the problem. It was on reverse IP 
 resolution on the drivers we used (wrong configuration of the local bind9). 
 Apparently, not being able to reverse-resolve the IP address of the nodes 
 was the culprit of the 10s delay.
 
 We've hit two other secondary problems with TorrentBroadcast though, in case 
 you're interested  :
 
 1 - Broadcasting a variable of about 2GB (1.8GB exactly) triggers a 
 java.lang.OutOfMemoryError: Requested array size exceeds VM limit, which 
 is not the case with HttpBroadcast (I guess HttpBroadcast splits the 
 serialized variable in small chunks)
 2 - Memory use of Torrent seems to be higher than Http (i.e. switching from 
 Http to Torrent triggers several OOM).
 
 Additionally, a question : while HttpBroadcast stores the broadcast pieces 
 on disk (in spark.local.dir/spark-... ), TorrentBroadcast seems not to use 
 disk backend storage. Does it mean that HttpBroadcast can handle bigger 
 broadcast out of memory ? If so, it's too bad that this design choice wasn't 
 used for Torrent.
 
 That being said, hats off to the people in charge of the broadcast unloading 
 wrt the lineage, this stuff works great !
 
 Guillaume
 
 
 Maybe there is a firewall issue that makes it slow for your nodes to 
 connect through the IP addresses they're configured with. I see there's 
 this 10 second pause between Updated info of block broadcast_84_piece1 
 and ensureFreeSpace(4194304) called (where it actually receives the 
 block). HTTP broadcast used only HTTP fetches from the executors to the 
 driver, but TorrentBroadcast has connections between the executors 
 themselves and between executors and the driver over a different port. 
 Where are you running your driver app and nodes?
 
 Matei
 
 On Oct 7, 2014, at 11:42 AM, Davies Liu dav...@databricks.com wrote:
 
 Could you create a JIRA for it? maybe it's a regression after
 https://issues.apache.org/jira/browse/SPARK-3119.
 
 We will appreciate that if you could tell how to reproduce it.
 
 On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
 guillaume.pi...@exensa.com wrote:
 Hi,
 
 I've had no answer to this on u...@spark.apache.org, so I post it on dev
 before filing a JIRA (in case the problem or solution is already 
 identified)
 
 We've had some performance issues since switching to 1.1.0, and we finally
 found the origin : TorrentBroadcast seems to be very slow in our setting
 (and it became default with 1.1.0)
 
 The logs of a 4MB variable with TorrentBroadcast : (15s)
 
 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84_piece1 
 stored
 as bytes in memory (estimated size 171.6 KB, free 7.2 GB)
 14/10/01 15:47:13 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_84_piece1
 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4194304) 
 called
 with curMem=1401611984, maxMem=9168696115
 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84_piece0 
 stored
 as bytes in memory (estimated size 4.0 MB, free 7.2 GB)
 14/10/01 15:47:23 INFO storage.BlockManagerMaster: Updated info of block
 broadcast_84_piece0
 14/10/01 15:47:23 INFO broadcast.TorrentBroadcast: Reading broadcast
 variable 84 took 15.202260006 s
 14/10/01 15:47:23 INFO storage.MemoryStore: ensureFreeSpace(4371392) 
 called
 with curMem=1405806288, maxMem=9168696115
 14/10/01 15:47:23 INFO storage.MemoryStore: Block broadcast_84 stored as
 values in memory (estimated size 4.2 MB, free 7.2 GB)
 
 (notice that a 10s lag happens after the Updated info of block
 broadcast_... and before the MemoryStore log
 
 And with HttpBroadcast (0.3s):
 
 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Started reading broadcast
 variable 147
 14/10/01 16:05:58 INFO storage.MemoryStore: ensureFreeSpace(4369376) 
 called
 with curMem=1373493232, maxMem=9168696115
 14/10/01 16:05:58 INFO storage.MemoryStore: Block broadcast_147 stored as
 values in memory (estimated size 4.2 MB, free 7.3 GB)
 14/10/01 16:05:58 INFO broadcast.HttpBroadcast: Reading broadcast variable
 147 took 0.320907112 s 14/10/01 16:05:58 INFO storage.BlockManager: Found
 block broadcast_147 locally
 
 Since Torrent is supposed to perform much better than Http, we suspect a
 configuration error from our side, but are unable to pin it down. Does
 someone have any idea of the origin of the problem ?
 
 For now we're sticking with the HttpBroadcast workaround.
 
 Guillaume
 --
 

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-09 Thread James Yu
Sounds great, thanks!



On Thu, Oct 9, 2014 at 2:22 PM, Michael Armbrust mich...@databricks.com
wrote:

 Yes, the foreign sources work is only about exposing a stable set of APIs
 for external libraries to link against (to avoid the spark assembly
 becoming a dependency mess).  The code path these APIs use will be the same
 as that for datasources included in the core spark sql library.

 Michael

 On Thu, Oct 9, 2014 at 2:18 PM, James Yu jym2...@gmail.com wrote:

 For performance, will foreign data format support, same as native ones?

 Thanks,
 James


 On Wed, Oct 8, 2014 at 11:03 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  The foreign data source API PR also matters here
  https://www.github.com/apache/spark/pull/2475
 
  Foreign data source like ORC can be added more easily and systematically
  after this PR is merged.
 
  On 10/9/14 8:22 AM, James Yu wrote:
 
  Thanks Mark! I will keep eye on it.
 
  @Evan, I saw people use both format, so I really want to have Spark
  support
  ORCFile.
 
 
  On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra m...@clearstorydata.com
 
  wrote:
 
   https://github.com/apache/spark/pull/2576
 
 
 
  On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan velvia.git...@gmail.com
  wrote:
 
   James,
 
  Michael at the meetup last night said there was some development
  activity around ORCFiles.
 
  I'm curious though, what are the pros and cons of ORCFiles vs
 Parquet?
 
  On Wed, Oct 8, 2014 at 10:03 AM, James Yu jym2...@gmail.com wrote:
 
  Didn't see anyone asked the question before, but I was wondering if
 
  anyone
 
  knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is
  getting more and more popular hi Hive world.
 
  Thanks,
  James
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 





spark-prs and mesos/spark-ec2

2014-10-09 Thread Nicholas Chammas
Does it make sense to point the Spark PR review board to read from
mesos/spark-ec2 as well? PRs submitted against that repo may reference
Spark JIRAs and need review just like any other Spark PR.

Nick


[Spark SQL] Strange NPE in Spark SQL with Hive

2014-10-09 Thread Trident
Hi Community,

  I use Spark 1.0.2, using Spark SQL to do Hive SQL.

  When I run the following code in Spark Shell:

val file = sc.textFile(./README.md)
val count = file.flatMap(line = line.split( )).map(word = (word, 
1)).reduceByKey(_+_)
count.collect()
‍
  Correct and no error!

  When I run the following code:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql(SHOW TABLES).collect().foreach(println)‍

  Correct and no error!

  But when I run:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql(SELECT COUNT(*) from uservisits).collect().foreach(println)‍

  It comes with some error messages.


  What I found was the following error:  
14/10/09 19:47:34 ERROR Executor: Exception in task ID 4 
java.lang.NullPointerException at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745) 14/10/09 19:47:34 INFO 
CoarseGrainedExecutorBackend: Got assigned task 5 14/10/09 19:47:34 INFO 
Executor: Running task ID 5 14/10/09 19:47:34 DEBUG BlockManager: Getting local 
block broadcast_1 14/10/09 19:47:34 DEBUG BlockManager: Level for block 
broadcast_1 is StorageLevel(true, true, false, true, 1) 14/10/09 19:47:34 DEBUG 
BlockManager: Getting block broadcast_1 from memory 14/10/09 19:47:34 INFO 
BlockManager: Found block broadcast_1 locally 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, 
targetRequestSize: 10066329 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: Getting 2 non-empty blocks out 
of 2 blocks 14/10/09 19:47:34 DEBUG 
BlockFetcherIterator$BasicBlockFetcherIterator: Sending request for 2 blocks 
(2.5 KB) from node19:50868 14/10/09 19:47:34 DEBUG BlockMessageArray: Adding 
BlockMessage [type = 1, id = shuffle_0_0_1, level = null, data = null] 14/10/09 
19:47:34 DEBUG BlockMessageArray: Added BufferMessage(id = 5, size = 34) 
14/10/09 19:47:34 DEBUG BlockMessageArray: Adding BlockMessage [type = 1, id = 
shuffle_0_1_1, level = null, data = null] 14/10/09 19:47:34 DEBUG 
BlockMessageArray: Added BufferMessage(id = 6, size = 34) 14/10/09 19:47:34 
DEBUG BlockMessageArray: Buffer list: 14/10/09 19:47:34 DEBUG 
BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] 14/10/09 19:47:34 
DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=34 cap=34] 14/10/09 
19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] 
14/10/09 19:47:34 DEBUG BlockMessageArray: java.nio.HeapByteBuffer[pos=0 lim=34 
cap=34] 14/10/09 19:47:34 INFO BlockFetcherIterator$BasicBlockFetcherIterator: 
Started 1 remote fetches in 2 ms 14/10/09 19:47:34 DEBUG 
BlockFetcherIterator$BasicBlockFetcherIterator: Got local blocks in  0 ms ms 
14/10/09 19:47:34 ERROR Executor: Exception in task ID 5 
java.lang.NullPointerException   at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745) 14/10/09 19:47:34 INFO 
CoarseGrainedExecutorBackend: Got assigned task 6 14/10/09 19:47:34 INFO 
Executor: Running task ID 6 14/10/09 19:47:34 DEBUG BlockManager: Getting local 
block broadcast_1 14/10/09 19:47:34 DEBUG BlockManager: Level for block 
broadcast_1 is StorageLevel(true, true, false, true, 1) 14/10/09 19:47:34 DEBUG 
BlockManager: Getting block broadcast_1 from memory 14/10/09 19:47:34 INFO 
BlockManager: Found block broadcast_1 locally 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, 
targetRequestSize: 10066329 14/10/09 19:47:34 INFO 
BlockFetcherIterator$BasicBlockFetcherIterator: 

[Spark SQL Continue] Sorry, it is not only limited in SQL, may due to network

2014-10-09 Thread Trident
Dear Community,

   Please ignore my last post about Spark SQL.

   When I run:
val file = sc.textFile(./README.md)
val count = file.flatMap(line = line.split( )).map(word = (word, 
1)).reduceByKey(_+_)
count.collect()
‍
it happends too.

is there any possible reason for that? we make have some adjustment in 
network last night


 Chen Weikeng
14/10/09 20:45:23 ERROR Executor: Exception in task ID 1 
java.lang.NullPointerException at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:116)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)   at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745) 14/10/09 20:45:23 INFO 
CoarseGrainedExecutorBackend: Got assigned task 2 14/10/09 20:45:23 INFO 
Executor: Running task ID 2 14/10/09 20:45:23 DEBUG BlockManager: Getting local 
block broadcast_0 14/10/09 20:45:23 DEBUG BlockManager: Level for block 
broadcast_0 is StorageLevel(true, true, false, true, 1) 14/10/09 20:45:23 DEBUG 
BlockManager: Getting block broadcast_0 from memory 14/10/09 20:45:23 INFO 
BlockManager: Found block broadcast_0 locally 14/10/09 20:45:23 DEBUG Executor: 
Task 2's epoch is 0 14/10/09 20:45:23 INFO HadoopRDD: Input split: 
file:/public/rdma14/app/spark-rdma/examples/src/main/resources/people.txt:16+16 
14/10/09 20:45:23 ERROR Executor: Exception in task ID 2 
java.lang.NullPointerException  at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:116)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)   at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)  at 
org.apache.spark.scheduler.Task.run(Task.scala:51)   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745)‍

Re: Spark on Mesos 0.20

2014-10-09 Thread Fairiz Azizi
Hello,

Sorry for the late reply.

When I tried the LogQuery example this time, things now seem to be fine!

...

14/10/10 04:01:21 INFO scheduler.DAGScheduler: Stage 0 (collect at
LogQuery.scala:80) finished in 0.429 s

14/10/10 04:01:21 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
whose tasks have all completed, from pool defa

14/10/10 04:01:21 INFO spark.SparkContext: Job finished: collect at
LogQuery.scala:80, took 12.802743914 s

(10.10.10.10,FRED,GET http://images.com/2013/Generic.jpg HTTP/1.1)
bytes=621   n=2


Not sure if this is the correct response for that example.

Our mesos/spark builds have since been updated since I last wrote.

Possibly, the JDK version was updated to 1.7.0_67

If you are using an older JDK, maybe try updating that?


- Fi



Fairiz Fi Azizi

On Wed, Oct 8, 2014 at 7:54 AM, RJ Nowling rnowl...@gmail.com wrote:

 Yep!  That's the example I was talking about.

 Is an error message printed when it hangs? I get :

 14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different block 
 manager registrations on 20140930-131734-1723727882-5050-1895-1



 On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi code...@gmail.com wrote:

 Sure, could you point me to the example?

 The only thing I could find was

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/LogQuery.scala

 So do you mean running it like:
MASTER=mesos://xxx*:5050* ./run-example LogQuery

 I tried that and I can see the job run and the tasks complete on the
 slave nodes, but the client process seems to hang forever, it's probably a
 different problem. BTW, only a dozen or so tasks kick off.

 I actually haven't done much with Scala and Spark (it's been all python).

 Fi



 Fairiz Fi Azizi

 On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling rnowl...@gmail.com wrote:

 I was able to reproduce it on a small 4 node cluster (1 mesos master and
 3 mesos slaves) with relatively low-end specs.  As I said, I just ran the
 log query examples with the fine-grained mesos mode.

 Spark 1.1.0 and mesos 0.20.1.

 Fairiz, could you try running the logquery example included with Spark
 and see what you get?

 Thanks!

 On Mon, Oct 6, 2014 at 8:07 PM, Fairiz Azizi code...@gmail.com wrote:

 That's what great about Spark, the community is so active! :)

 I compiled Mesos 0.20.1 from the source tarball.

 Using the Mapr3 Spark 1.1.0 distribution from the Spark downloads page
  (spark-1.1.0-bin-mapr3.tgz).

 I see no problems for the workloads we are trying.

 However, the cluster is small (less than 100 cores across 3 nodes).

 The workloads reads in just a few gigabytes from HDFS, via an ipython
 notebook spark shell.

 thanks,
 Fi



 Fairiz Fi Azizi

 On Mon, Oct 6, 2014 at 9:20 AM, Timothy Chen tnac...@gmail.com wrote:

 Ok I created SPARK-3817 to track this, will try to repro it as well.

 Tim

 On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling rnowl...@gmail.com wrote:
  I've recently run into this issue as well. I get it from running
 Spark
  examples such as log query.  Maybe that'll help reproduce the issue.
 
 
  On Monday, October 6, 2014, Gurvinder Singh 
 gurvinder.si...@uninett.no
  wrote:
 
  The issue does not occur if the task at hand has small number of map
  tasks. I have a task which has 978 map tasks and I see this error as
 
  14/10/06 09:34:40 ERROR BlockManagerMasterActor: Got two different
 block
  manager registrations on 20140711-081617-711206558-5050-2543-5
 
  Here is the log from the mesos-slave where this container was
 running.
 
  http://pastebin.com/Q1Cuzm6Q
 
  If you look for the code from where error produced by spark, you
 will
  see that it simply exit and saying in comments this should never
  happen, lets just quit :-)
 
  - Gurvinder
  On 10/06/2014 09:30 AM, Timothy Chen wrote:
   (Hit enter too soon...)
  
   What is your setup and steps to repro this?
  
   Tim
  
   On Mon, Oct 6, 2014 at 12:30 AM, Timothy Chen tnac...@gmail.com
 wrote:
   Hi Gurvinder,
  
   I tried fine grain mode before and didn't get into that problem.
  
  
   On Sun, Oct 5, 2014 at 11:44 PM, Gurvinder Singh
   gurvinder.si...@uninett.no wrote:
   On 10/06/2014 08:19 AM, Fairiz Azizi wrote:
   The Spark online docs indicate that Spark is compatible with
 Mesos
   0.18.1
  
   I've gotten it to work just fine on 0.18.1 and 0.18.2
  
   Has anyone tried Spark on a newer version of Mesos, i.e. Mesos
   v0.20.0?
  
   -Fi
  
   Yeah we are using Spark 1.1.0 with Mesos 0.20.1. It runs fine in
   coarse
   mode, in fine grain mode there is an issue with blockmanager
 names
   conflict. I have been waiting for it to be fixed but it is still
   there.
  
   -Gurvinder
  
  
 -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
 
 
 
 -
  To unsubscribe, e-mail: