Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Andrew Ash
I don't know if anyone has done benchmarking of different JVMs for Spark
specifically, but the widely-held belief seems to be that the Oracle JDK is
slightly more performant.

Elasticsearch makes heavy usage of Lucene, which is a particularly intense
workout for a JVM, and they recommend using the Oracle JVM where possible.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-service.html#_installing_the_oracle_jdk


On Sun, May 18, 2014 at 11:50 PM, Hao Wang  wrote:

> Hi,
>
> Oracle JDK and OpenJDK, which one is better or preferred for Spark?
>
>
> Regards,
> Wang Hao(王灏)
>
> CloudTeam | School of Software Engineering
> Shanghai Jiao Tong University
> Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
> Email:wh.s...@gmail.com
>


Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Gordon Wang
I would like to say that Oracle JDK may be the better choice. A lot of
hadoop distribution vendors use Oracle JDK instead of Open JDK for
enterprise.


On Mon, May 19, 2014 at 2:50 PM, Hao Wang  wrote:

> Hi,
>
> Oracle JDK and OpenJDK, which one is better or preferred for Spark?
>
>
> Regards,
> Wang Hao(王灏)
>
> CloudTeam | School of Software Engineering
> Shanghai Jiao Tong University
> Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
> Email:wh.s...@gmail.com
>



-- 
Regards
Gordon Wang


For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Hao Wang
Hi,

Oracle JDK and OpenJDK, which one is better or preferred for Spark?


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com


Re: sync master with slaves with bittorrent?

2014-05-18 Thread Andrew Ash
If the codebase for Spark's broadcast is pretty self-contained, you could
consider creating a small bootstrap sent out via the doubling rsync
strategy that Mosharaf outlined above (called "Tree D=2" in the paper) that
then pulled the larger

Mosharaf, do you have a sense of whether the gains from using Cornet vs
Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast
mechanism?

Andrew


On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson  wrote:

> One issue with using Spark itself is that this rsync is required to get
> Spark to work...
>
> Also note that a similar strategy is used for *updating* the spark
> cluster on ec2, where the "diff" aspect is much more important, as you
> might only make a small change on the driver node (recompile or
> reconfigure) and can get a fast sync.
>
>
> On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury <
> mosharafka...@gmail.com> wrote:
>
>> What twitter calls murder, unless it has changed since then, is just a
>> BitTornado wrapper. In 2011, We did some comparison on the performance of
>> murder and the TorrentBroadcast we have right now for Spark's own broadcast
>> (Section 7.1 in
>> http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
>> Spark's implementation was 4.5X faster than murder.
>>
>> The only issue with using TorrentBroadcast to deploy code/VM is writing a
>> wrapper around it to read from disk, but it shouldn't be too complicated.
>> If someone picks it up, I can give some pointers on how to proceed (I've
>> thought about doing it myself forever, but never ended up actually taking
>> the time; right now I don't have enough free cycles either)
>>
>> Otherwise, murder/BitTornado would be better than the current strategy we
>> have.
>>
>> A third option would be to use rsync; but instead of rsync-ing to every
>> slave from the master, one can simply rsync from the master first to one
>> slave; then use the two sources (master and the first slave) to rsync to
>> two more; then four and so on. Might be a simpler solution without many
>> changes.
>>
>> --
>> Mosharaf Chowdhury
>> http://www.mosharaf.com/
>>
>>
>> On Sun, May 18, 2014 at 11:07 PM, Andrew Ash wrote:
>>
>>> My first thought would be to use libtorrent for this setup, and it turns
>>> out that both Twitter and Facebook do code deploys with a bittorrent setup.
>>>  Twitter even released their code as open source:
>>>
>>>
>>> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent
>>>
>>>
>>> http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/
>>>
>>>
>>> On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler wrote:
>>>
 I am not an expert in this space either. I thought the initial rsync
 during launch is really just a straight copy that did not need the tree
 diff. So it seemed like having the slaves do the copying among it each
 other would be better than having the master copy to everyone directly.
 That made me think of bittorrent, though there may well be other systems
 that do this.
 From the launches I did today it seems that it is taking around 1
 minute per slave to launch a cluster, which can be a problem for clusters
 with 10s or 100s of slaves, particularly since on ec2  that time has to be
 paid for.


 On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson wrote:

> Out of curiosity, do you have a library in mind that would make it
> easy to setup a bit torrent network and distribute files in an rsync 
> (i.e.,
> apply a diff to a tree, ideally) fashion? I'm not familiar with this 
> space,
> but we do want to minimize the complexity of our standard ec2 launch
> scripts to reduce the chance of something breaking.
>
>
> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler wrote:
>
>> I am launching a rather large cluster on ec2.
>> It seems like the launch is taking forever on
>> 
>> Setting up spark
>> RSYNC'ing /root/spark to slaves...
>> ...
>>
>> It seems that bittorrent might be a faster way to replicate
>> the sizeable spark directory to the slaves
>> particularly if there is a lot of not very powerful slaves.
>>
>> Just a thought ...
>>
>> cheers
>> Daniel
>>
>>
>

>>>
>>
>


Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson
One issue with using Spark itself is that this rsync is required to get
Spark to work...

Also note that a similar strategy is used for *updating* the spark cluster
on ec2, where the "diff" aspect is much more important, as you might only
make a small change on the driver node (recompile or reconfigure) and can
get a fast sync.


On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury <
mosharafka...@gmail.com> wrote:

> What twitter calls murder, unless it has changed since then, is just a
> BitTornado wrapper. In 2011, We did some comparison on the performance of
> murder and the TorrentBroadcast we have right now for Spark's own broadcast
> (Section 7.1 in
> http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
> Spark's implementation was 4.5X faster than murder.
>
> The only issue with using TorrentBroadcast to deploy code/VM is writing a
> wrapper around it to read from disk, but it shouldn't be too complicated.
> If someone picks it up, I can give some pointers on how to proceed (I've
> thought about doing it myself forever, but never ended up actually taking
> the time; right now I don't have enough free cycles either)
>
> Otherwise, murder/BitTornado would be better than the current strategy we
> have.
>
> A third option would be to use rsync; but instead of rsync-ing to every
> slave from the master, one can simply rsync from the master first to one
> slave; then use the two sources (master and the first slave) to rsync to
> two more; then four and so on. Might be a simpler solution without many
> changes.
>
> --
> Mosharaf Chowdhury
> http://www.mosharaf.com/
>
>
> On Sun, May 18, 2014 at 11:07 PM, Andrew Ash  wrote:
>
>> My first thought would be to use libtorrent for this setup, and it turns
>> out that both Twitter and Facebook do code deploys with a bittorrent setup.
>>  Twitter even released their code as open source:
>>
>>
>> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent
>>
>>
>> http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/
>>
>>
>> On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler wrote:
>>
>>> I am not an expert in this space either. I thought the initial rsync
>>> during launch is really just a straight copy that did not need the tree
>>> diff. So it seemed like having the slaves do the copying among it each
>>> other would be better than having the master copy to everyone directly.
>>> That made me think of bittorrent, though there may well be other systems
>>> that do this.
>>> From the launches I did today it seems that it is taking around 1 minute
>>> per slave to launch a cluster, which can be a problem for clusters with 10s
>>> or 100s of slaves, particularly since on ec2  that time has to be paid for.
>>>
>>>
>>> On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson wrote:
>>>
 Out of curiosity, do you have a library in mind that would make it easy
 to setup a bit torrent network and distribute files in an rsync (i.e.,
 apply a diff to a tree, ideally) fashion? I'm not familiar with this space,
 but we do want to minimize the complexity of our standard ec2 launch
 scripts to reduce the chance of something breaking.


 On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler wrote:

> I am launching a rather large cluster on ec2.
> It seems like the launch is taking forever on
> 
> Setting up spark
> RSYNC'ing /root/spark to slaves...
> ...
>
> It seems that bittorrent might be a faster way to replicate
> the sizeable spark directory to the slaves
> particularly if there is a lot of not very powerful slaves.
>
> Just a thought ...
>
> cheers
> Daniel
>
>

>>>
>>
>


Re: problem with hdfs access in spark job

2014-05-18 Thread Marcin Cylke
On Thu, 15 May 2014 09:44:35 -0700
Marcelo Vanzin  wrote:

> These are actually not worrisome; that's just the HDFS client doing
> its own thing to support HA. It probably picked the "wrong" NN to try
> first, and got the "NN in standby" exception, which it logs. Then it
> tries the other NN and things just work as expected. Business as
> usual.
> 
> Not sure about the other exceptions you mention. I've seen the second
> one before, but it didn't seem to affect my jobs - maybe some race
> during cleanup.
> 

Ok, great to hear, that this errors are not that serious.

Thanks
Marcin


Re: sync master with slaves with bittorrent?

2014-05-18 Thread Mosharaf Chowdhury
What twitter calls murder, unless it has changed since then, is just a
BitTornado wrapper. In 2011, We did some comparison on the performance of
murder and the TorrentBroadcast we have right now for Spark's own broadcast
(Section 7.1 in
http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf).
Spark's implementation was 4.5X faster than murder.

The only issue with using TorrentBroadcast to deploy code/VM is writing a
wrapper around it to read from disk, but it shouldn't be too complicated.
If someone picks it up, I can give some pointers on how to proceed (I've
thought about doing it myself forever, but never ended up actually taking
the time; right now I don't have enough free cycles either)

Otherwise, murder/BitTornado would be better than the current strategy we
have.

A third option would be to use rsync; but instead of rsync-ing to every
slave from the master, one can simply rsync from the master first to one
slave; then use the two sources (master and the first slave) to rsync to
two more; then four and so on. Might be a simpler solution without many
changes.

--
Mosharaf Chowdhury
http://www.mosharaf.com/


On Sun, May 18, 2014 at 11:07 PM, Andrew Ash  wrote:

> My first thought would be to use libtorrent for this setup, and it turns
> out that both Twitter and Facebook do code deploys with a bittorrent setup.
>  Twitter even released their code as open source:
>
>
> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent
>
>
> http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/
>
>
> On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler  wrote:
>
>> I am not an expert in this space either. I thought the initial rsync
>> during launch is really just a straight copy that did not need the tree
>> diff. So it seemed like having the slaves do the copying among it each
>> other would be better than having the master copy to everyone directly.
>> That made me think of bittorrent, though there may well be other systems
>> that do this.
>> From the launches I did today it seems that it is taking around 1 minute
>> per slave to launch a cluster, which can be a problem for clusters with 10s
>> or 100s of slaves, particularly since on ec2  that time has to be paid for.
>>
>>
>> On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson wrote:
>>
>>> Out of curiosity, do you have a library in mind that would make it easy
>>> to setup a bit torrent network and distribute files in an rsync (i.e.,
>>> apply a diff to a tree, ideally) fashion? I'm not familiar with this space,
>>> but we do want to minimize the complexity of our standard ec2 launch
>>> scripts to reduce the chance of something breaking.
>>>
>>>
>>> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler wrote:
>>>
 I am launching a rather large cluster on ec2.
 It seems like the launch is taking forever on
 
 Setting up spark
 RSYNC'ing /root/spark to slaves...
 ...

 It seems that bittorrent might be a faster way to replicate
 the sizeable spark directory to the slaves
 particularly if there is a lot of not very powerful slaves.

 Just a thought ...

 cheers
 Daniel


>>>
>>
>


Re: sync master with slaves with bittorrent?

2014-05-18 Thread Andrew Ash
My first thought would be to use libtorrent for this setup, and it turns
out that both Twitter and Facebook do code deploys with a bittorrent setup.
 Twitter even released their code as open source:

https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent

http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/


On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler  wrote:

> I am not an expert in this space either. I thought the initial rsync
> during launch is really just a straight copy that did not need the tree
> diff. So it seemed like having the slaves do the copying among it each
> other would be better than having the master copy to everyone directly.
> That made me think of bittorrent, though there may well be other systems
> that do this.
> From the launches I did today it seems that it is taking around 1 minute
> per slave to launch a cluster, which can be a problem for clusters with 10s
> or 100s of slaves, particularly since on ec2  that time has to be paid for.
>
>
> On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson wrote:
>
>> Out of curiosity, do you have a library in mind that would make it easy
>> to setup a bit torrent network and distribute files in an rsync (i.e.,
>> apply a diff to a tree, ideally) fashion? I'm not familiar with this space,
>> but we do want to minimize the complexity of our standard ec2 launch
>> scripts to reduce the chance of something breaking.
>>
>>
>> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler  wrote:
>>
>>> I am launching a rather large cluster on ec2.
>>> It seems like the launch is taking forever on
>>> 
>>> Setting up spark
>>> RSYNC'ing /root/spark to slaves...
>>> ...
>>>
>>> It seems that bittorrent might be a faster way to replicate
>>> the sizeable spark directory to the slaves
>>> particularly if there is a lot of not very powerful slaves.
>>>
>>> Just a thought ...
>>>
>>> cheers
>>> Daniel
>>>
>>>
>>
>


Re: Using mongo with PySpark

2014-05-18 Thread Samarth Mailinglist
db = MongoClient()['spark_test_db']
*collec = db['programs']*

def mapper(val):
asc = val.encode('ascii','ignore')
json = convertToJSON(asc, indexMap)
collec.insert(json) # *this is not working*

def convertToJSON(string, indexMap):
values = string.strip().split(",")
json = {}
for i in range(len(values)):
json[indexMap[i]] = values[i]
return json

*jsons = data.map(mapper)*



*The last line does the mapping. I am very new to Spark, can you explain
what explicit serialization, etc is in the context of spark? The error I am
getting:*
*Traceback (most recent call last):  File "", line 1, in 
File "/usr/local/spark-0.9.1/python/pyspark/rdd.py", line 712, in
saveAsTextFile
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)  File
"/usr/local/spark-0.9.1/python/pyspark/rdd.py", line 1178, in _jrdd
pickled_command = CloudPickleSerializer().dumps(command)  File
"/usr/local/spark-0.9.1/python/pyspark/serializers.py", line 275, in dumps
  def dumps(self, obj): return cloudpickle.dumps(obj, 2)  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 801, in dumps
  cp.dump(obj)  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 140, in dump
  return pickle.Pickler.dump(self, obj)  File
"/usr/lib/python2.7/pickle.py", line 224, in dumpself.save(obj)  File
"/usr/lib/python2.7/pickle.py", line 286, in savef(self, obj) # Call
unbound method with explicit self  File "/usr/lib/python2.7/pickle.py",
line 548, in save_tuplesave(element)  File
"/usr/lib/python2.7/pickle.py", line 286, in savef(self, obj) # Call
unbound method with explicit self  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 259, in
save_functionself.save_function_tuple(obj, [themodule])  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 316, in
save_function_tuplesave(closure)  File "/usr/lib/python2.7/pickle.py",
line 286, in savef(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))  File "/usr/lib/python2.7/pickle.py", line
633, in _batch_appendssave(x)  File "/usr/lib/python2.7/pickle.py",
line 286, in savef(self, obj) # Call unbound method with explicit self
File "/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 259, in
save_functionself.save_function_tuple(obj, [themodule])  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 316, in
save_function_tuplesave(closure)  File "/usr/lib/python2.7/pickle.py",
line 286, in savef(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))  File "/usr/lib/python2.7/pickle.py", line
636, in _batch_appendssave(tmp[0])  File
"/usr/lib/python2.7/pickle.py", line 286, in savef(self, obj) # Call
unbound method with explicit self  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 254, in
save_functionself.save_function_tuple(obj, modList)  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 314, in
save_function_tuplesave(f_globals)  File
"/usr/lib/python2.7/pickle.py", line 286, in savef(self, obj) # Call
unbound method with explicit self  File
"/usr/local/spark-0.9.1/python/pyspark/cloudpickle.py", line 181, in
save_dictpickle.Pickler.save_dict(self, obj)  File
"/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())  File "/usr/lib/python2.7/pickle.py",
line 681, in _batch_setitemssave(v)  File
"/usr/lib/python2.7/pickle.py", line 306, in saverv =
reduce(self.proto)  File
"/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 1489,
in __call__self.__name.split(".")[-1])TypeError: 'Collection' object is
not callable. If you meant to call the '__getnewargs__' method on a
'Collection' object it is failing because no such method exists.*


On Sat, May 17, 2014 at 9:30 PM, Mayur Rustagi wrote:

> You have to ideally pass the mongoclient object along with your data in
> the mapper(python should be try to serialize your mongoclient, but explicit
> is better)
> if client is serializable then all should end well.. if not then you are
> better off using map partition & initilizing the driver in each iteration &
> load data of each partition. Thr is a similar discussion in the list in the
> past.
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Sat, May 17, 2014 at 8:58 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Where's your driver code (the code interacting with the RDDs)? Are you
>> getting serialization errors?
>>
>> 2014년 5월 17일 토요일, Samarth Mailinglist님이
>> 작성한 메시지:
>>
>> Hi all,
>>>
>>> I am trying to store the results of a reduce into mongo.
>>> I want to share the variable "collection" in the mappers.
>>>
>>>
>>> Here'

Re: sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am not an expert in this space either. I thought the initial rsync during
launch is really just a straight copy that did not need the tree diff. So
it seemed like having the slaves do the copying among it each other would
be better than having the master copy to everyone directly. That made me
think of bittorrent, though there may well be other systems that do this.
>From the launches I did today it seems that it is taking around 1 minute
per slave to launch a cluster, which can be a problem for clusters with 10s
or 100s of slaves, particularly since on ec2  that time has to be paid for.


On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson  wrote:

> Out of curiosity, do you have a library in mind that would make it easy to
> setup a bit torrent network and distribute files in an rsync (i.e., apply a
> diff to a tree, ideally) fashion? I'm not familiar with this space, but we
> do want to minimize the complexity of our standard ec2 launch scripts to
> reduce the chance of something breaking.
>
>
> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler  wrote:
>
>> I am launching a rather large cluster on ec2.
>> It seems like the launch is taking forever on
>> 
>> Setting up spark
>> RSYNC'ing /root/spark to slaves...
>> ...
>>
>> It seems that bittorrent might be a faster way to replicate
>> the sizeable spark directory to the slaves
>> particularly if there is a lot of not very powerful slaves.
>>
>> Just a thought ...
>>
>> cheers
>> Daniel
>>
>>
>


Re: sync master with slaves with bittorrent?

2014-05-18 Thread Aaron Davidson
Out of curiosity, do you have a library in mind that would make it easy to
setup a bit torrent network and distribute files in an rsync (i.e., apply a
diff to a tree, ideally) fashion? I'm not familiar with this space, but we
do want to minimize the complexity of our standard ec2 launch scripts to
reduce the chance of something breaking.


On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler  wrote:

> I am launching a rather large cluster on ec2.
> It seems like the launch is taking forever on
> 
> Setting up spark
> RSYNC'ing /root/spark to slaves...
> ...
>
> It seems that bittorrent might be a faster way to replicate
> the sizeable spark directory to the slaves
> particularly if there is a lot of not very powerful slaves.
>
> Just a thought ...
>
> cheers
> Daniel
>
>


sync master with slaves with bittorrent?

2014-05-18 Thread Daniel Mahler
I am launching a rather large cluster on ec2.
It seems like the launch is taking forever on

Setting up spark
RSYNC'ing /root/spark to slaves...
...

It seems that bittorrent might be a faster way to replicate
the sizeable spark directory to the slaves
particularly if there is a lot of not very powerful slaves.

Just a thought ...

cheers
Daniel


unsubscribe

2014-05-18 Thread Venkat Krishnamurthy




Re: unsubscribe

2014-05-18 Thread Madhu
The volume on the list has grown to the point that individual emails can
become excessive.
That might be the reason for increase in recent unsubscribes.

You can subscribe to daily digests only using these addresses:

Similar addresses exist for the digest list:
   
   

The DEV list has an option to do that from the Web UI, but I don't see that
on the user list.



-
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/unsubscribe-tp5985p6004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
correct what i said above:

ldconfig does work, it automatically makes a link:
libopenblas.so.0 -> libopenblas_nehalemp-r0.2.9.rc2.so

but what i need is libblas.so.3, so i tried several ways
1. create a file called libblas.so.3, then ldconfig. 
2. create a file called libblas.so.3.0 then ldconfig

i hope ldconfig will generate a link file called libblas.so.3, but it seems
libblas.so.3 is ignored
by ldconfig



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p6002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
thank you xiangrui, i also think it's maybe the problem of link

i tried several ways:
1. export LD_LIBRARY_PATH=mypath
2. create the link file in /usr/lib
lrwxrwxrwx 1 root root   34 May 19 00:38 libblas.so.3 ->
libopenblas_nehalemp-r0.2.9.rc2.so
3. add mypath to /etc/ld.so.conf, then ldconfig

  1 and 2 does not work, as to 3, it seems that ldconfig doesn't work in
amazon linux, i check it by using   
  ldconfig -p, but can not find my .so file


Xiangrui Meng wrote
> The classpath seems to be correct. Where did you link libopenblas*.so
> to? The safest approach is to rename it to /usr/lib/libblas.so.3 and
> /usr/lib/liblapack.so.3 . This is the way I made it work. -Xiangrui
> 
> On Sun, May 18, 2014 at 4:49 PM, wxhsdp <

> wxhsdp@

> > wrote:
>> ok
>>
>> Spark Executor Command: "java" "-cp"
>> ":/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/root/.sbt/boot/scala-2.10.3/lib/scala-reflect.jar:/root/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/root/.ivy2/cache/com.github.fommil.netlib/core/jars/core-1.1.2.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1-javadoc.jar:/root/.ivy2/cache/net.sf.opencsv/opencsv/jars/opencsv-2.3.jar:/root/.ivy2/cache/com.github.rwl/jtransforms/jars/jtransforms-2.4.0.jar:/root/.ivy2/cache/junit/junit/jars/junit-4.8.2.jar:/root/.ivy2/cache/org.apache.commons/commons-math3/jars/commons-math3-3.2.jar:/root/.ivy2/cache/org.spire-math/spire_2.10/jars/spire_2.10-0.7.1.jar:/root/.ivy2/cache/org.spire-math/spire-macros_2.10/jars/spire-macros_2.10-0.7.1.jar:/root/.ivy2/cache/com.typesafe/scalalogging-slf4j_2.10/jars/scalalogging-slf4j_2.10-1.0.1.jar:/root/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.2.jar:/root/.ivy2/cache/org.scalanlp/breeze-natives_2.10/jars/breeze-natives_2.10-0.7.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/jars/netlib-native_ref-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_ref-java/jars/native_ref-java-1.1.jar:/root/.ivy2/cache/com.github.fommil/jniloader/jars/jniloader-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/jars/netlib-native_ref-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-i686/jars/netlib-native_ref-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-x86_64/jars/netlib-native_ref-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-i686/jars/netlib-native_ref-win-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-armhf/jars/netlib-native_ref-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-osx-x86_64/jars/netlib-native_system-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_system-java/jars/native_system-java-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-x86_64/jars/netlib-native_system-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-i686/jars/netlib-native_system-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-armhf/jars/netlib-native_system-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-x86_64/jars/netlib-native_system-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-i686/jars/netlib-native_system-win-i686-1.1-natives.jar
>> ::/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
>> "-Xms4096M" "-Xmx4096M"
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5994.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p6000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: breeze DGEMM slow in spark

2014-05-18 Thread Xiangrui Meng
The classpath seems to be correct. Where did you link libopenblas*.so
to? The safest approach is to rename it to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3 . This is the way I made it work. -Xiangrui

On Sun, May 18, 2014 at 4:49 PM, wxhsdp  wrote:
> ok
>
> Spark Executor Command: "java" "-cp"
> ":/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/root/.sbt/boot/scala-2.10.3/lib/scala-reflect.jar:/root/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/root/.ivy2/cache/com.github.fommil.netlib/core/jars/core-1.1.2.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1-javadoc.jar:/root/.ivy2/cache/net.sf.opencsv/opencsv/jars/opencsv-2.3.jar:/root/.ivy2/cache/com.github.rwl/jtransforms/jars/jtransforms-2.4.0.jar:/root/.ivy2/cache/junit/junit/jars/junit-4.8.2.jar:/root/.ivy2/cache/org.apache.commons/commons-math3/jars/commons-math3-3.2.jar:/root/.ivy2/cache/org.spire-math/spire_2.10/jars/spire_2.10-0.7.1.jar:/root/.ivy2/cache/org.spire-math/spire-macros_2.10/jars/spire-macros_2.10-0.7.1.jar:/root/.ivy2/cache/com.typesafe/scalalogging-slf4j_2.10/jars/scalalogging-slf4j_2.10-1.0.1.jar:/root/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.2.jar:/root/.ivy2/cache/org.scalanlp/breeze-natives_2.10/jars/breeze-natives_2.10-0.7.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/jars/netlib-native_ref-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_ref-java/jars/native_ref-java-1.1.jar:/root/.ivy2/cache/com.github.fommil/jniloader/jars/jniloader-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/jars/netlib-native_ref-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-i686/jars/netlib-native_ref-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-x86_64/jars/netlib-native_ref-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-i686/jars/netlib-native_ref-win-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-armhf/jars/netlib-native_ref-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-osx-x86_64/jars/netlib-native_system-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_system-java/jars/native_system-java-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-x86_64/jars/netlib-native_system-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-i686/jars/netlib-native_system-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-armhf/jars/netlib-native_system-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-x86_64/jars/netlib-native_system-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-i686/jars/netlib-native_system-win-i686-1.1-natives.jar
> ::/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
> "-Xms4096M" "-Xmx4096M"
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5994.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Passing runtime config to workers?

2014-05-18 Thread DB Tsai
When you reference any variable outside the executor's scope, spark will
automatically serialize them in the driver, and send them to executors,
which implies, those variables have to implement serializable.

For the example you mention, the Spark will serialize object F, and if it's
not serializable, it will raise exception.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 12:58 PM, Robert James wrote:

> I see - I didn't realize that scope would work like that.  Are you
> saying that any variable that is in scope of the lambda passed to map
> will be automagically propagated to all workers? What if it's not
> explicitly referenced in the map, only used by it.  E.g.:
>
> def main:
>   settings.setSettings
>   rdd.map(x => F.f(x))
>
> object F {
>   def f(...)...
>   val settings:...
> }
>
> F.f accesses F.settings, like a Singleton.  The master sets F.settings
> before using F.f in a map.  Will all workers have the same F.settings
> as seen by F.f?
>
>
>
> On 5/16/14, DB Tsai  wrote:
> > Since the evn variables in driver will not be passed into workers, the
> most
> > easy way you can do is refer to the variables directly in workers from
> > driver.
> >
> > For example,
> >
> > val variableYouWantToUse = System.getenv("something defined in env")
> >
> > rdd.map(
> > you can access `variableYouWantToUse` here
> > )
> >
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Fri, May 16, 2014 at 1:59 PM, Robert James
> > wrote:
> >
> >> What is a good way to pass config variables to workers?
> >>
> >> I've tried setting them in environment variables via spark-env.sh, but,
> >> as
> >> far as I can tell, the environment variables set there don't appear in
> >> workers' environments.  If I want to be able to configure all workers,
> >> what's a good way to do it?  For example, I want to tell all workers:
> >> USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.
> >>
> >
>


unsubscribe

2014-05-18 Thread Terje Berg-Hansen


Andre Bois-Crettez  skrev: 

>We never saw your exception when reading bzip2 files with spark.
>
>But when we wrongly compiled spark against older version of hadoop (was
>default in spark), we ended up with sequential reading of bzip2 file,
>not taking advantage of block splits to work in parallel.
>Once we compiled spark with SPARK_HADOOP_VERSION=2.2.0, files were read
>in parallel, as expected with a recent hadoop.
>
>http://spark.apache.org/docs/0.9.1/#a-note-about-hadoop-versions
>
>Make sure Spark is compiled against Hadoop v2
>
>André
>
>On 2014-05-13 18:08, Xiangrui Meng wrote:
>> Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
>> the problem you described, but it does contain several fixes to bzip2
>> format. -Xiangrui
>>
>> On Wed, May 7, 2014 at 9:19 PM, Andrew Ash  wrote:
>>> Hi all,
>>>
>>> Is anyone reading and writing to .bz2 files stored in HDFS from Spark with
>>> success?
>>>
>>>
>>> I'm finding the following results on a recent commit (756c96 from 24hr ago)
>>> and CDH 4.4.0:
>>>
>>> Works: val r = sc.textFile("/user/aa/myfile.bz2").count
>>> Doesn't work: val r = sc.textFile("/user/aa/myfile.bz2").map((s:String) =>
>>> s+"| " ).count
>>>
>>> Specifically, I'm getting an exception coming out of the bzip2 libraries
>>> (see below stacktraces), which is unusual because I'm able to read from that
>>> file without an issue using the same libraries via Pig.  It was originally
>>> created from Pig as well.
>>>
>>> Digging a little deeper I found this line in the .bz2 decompressor's javadoc
>>> for CBZip2InputStream:
>>>
>>> "Instances of this class are not threadsafe." [source]
>>>
>>>
>>> My current working theory is that Spark has a much higher level of
>>> parallelism than Pig/Hadoop does and thus I get these wild IndexOutOfBounds
>>> exceptions much more frequently (as in can't finish a run over a little 2M
>>> row file) vs hardly at all in other libraries.
>>>
>>> The only other reference I could find to the issue was in presto-users, but
>>> the recommendation to leave .bz2 for .lzo doesn't help if I actually do want
>>> the higher compression levels of .bz2.
>>>
>>>
>>> Would love to hear if I have some kind of configuration issue or if there's
>>> a bug in .bz2 that's fixed in later versions of CDH, or generally any other
>>> thoughts on the issue.
>>>
>>>
>>> Thanks!
>>> Andrew
>>>
>>>
>>>
>>> Below are examples of some exceptions I'm getting:
>>>
>>> 14/05/07 15:09:49 WARN scheduler.TaskSetManager: Loss was due to
>>> java.lang.ArrayIndexOutOfBoundsException
>>> java.lang.ArrayIndexOutOfBoundsException: 65535
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.hbCreateDecodeTables(CBZip2InputStream.java:663)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.createHuffmanDecodingTables(CBZip2InputStream.java:790)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:762)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:798)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
>>>  at
>>> org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
>>>  at java.io.InputStream.read(InputStream.java:101)
>>>  at
>>> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
>>>  at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
>>>  at
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
>>>  at
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:43)
>>>
>>>
>>>
>>>
>>> java.lang.ArrayIndexOutOfBoundsException: 90
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:900)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:502)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
>>>  at
>>> org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:397)
>>>  at
>>> org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:426)
>>>  at java.io.InputStream.read(InputStream.java:101)
>>>  at
>>> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)
>>>  at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)
>>>  at
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:203)
>>>  at
>>> org.apa

making spark/conf/spark-defaults.conf changes take effect

2014-05-18 Thread Daniel Mahler
I am running in an aws ec2 cluster that i launched using the spark-ec2
script that comes with spark
and I use the "-v master" option to run the head version.

If I then log into master and make changes to spark/conf/spark-defaults.conf
How do I make the changes take effect across the cluster?

Is just restarting spark-shell enough? (It does not seem to be)
Does  "~/spark/sbin/stop-all.sh ; sleep 5; ~/spark/sbin/start-all.sh" do it?
Do I need to copy the new spark-defaults.conf to all the slaves?
Or is there some command to sync everything?

thanks
Daniel


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
ok

Spark Executor Command: "java" "-cp"
":/root/ephemeral-hdfs/conf:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scalanlp/breeze_2.10/jars/breeze_2.10-0.7.jar:/root/.ivy2/cache/org.scalanlp/breeze-macros_2.10/jars/breeze-macros_2.10-0.3.jar:/root/.sbt/boot/scala-2.10.3/lib/scala-reflect.jar:/root/.ivy2/cache/com.thoughtworks.paranamer/paranamer/jars/paranamer-2.2.jar:/root/.ivy2/cache/com.github.fommil.netlib/core/jars/core-1.1.2.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1.jar:/root/.ivy2/cache/net.sourceforge.f2j/arpack_combined_all/jars/arpack_combined_all-0.1-javadoc.jar:/root/.ivy2/cache/net.sf.opencsv/opencsv/jars/opencsv-2.3.jar:/root/.ivy2/cache/com.github.rwl/jtransforms/jars/jtransforms-2.4.0.jar:/root/.ivy2/cache/junit/junit/jars/junit-4.8.2.jar:/root/.ivy2/cache/org.apache.commons/commons-math3/jars/commons-math3-3.2.jar:/root/.ivy2/cache/org.spire-math/spire_2.10/jars/spire_2.10-0.7.1.jar:/root/.ivy2/cache/org.spire-math/spire-macros_2.10/jars/spire-macros_2.10-0.7.1.jar:/root/.ivy2/cache/com.typesafe/scalalogging-slf4j_2.10/jars/scalalogging-slf4j_2.10-1.0.1.jar:/root/.ivy2/cache/org.slf4j/slf4j-api/jars/slf4j-api-1.7.2.jar:/root/.ivy2/cache/org.scalanlp/breeze-natives_2.10/jars/breeze-natives_2.10-0.7.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-osx-x86_64/jars/netlib-native_ref-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_ref-java/jars/native_ref-java-1.1.jar:/root/.ivy2/cache/com.github.fommil/jniloader/jars/jniloader-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-x86_64/jars/netlib-native_ref-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-i686/jars/netlib-native_ref-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-x86_64/jars/netlib-native_ref-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-win-i686/jars/netlib-native_ref-win-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_ref-linux-armhf/jars/netlib-native_ref-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-osx-x86_64/jars/netlib-native_system-osx-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/native_system-java/jars/native_system-java-1.1.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-x86_64/jars/netlib-native_system-linux-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-i686/jars/netlib-native_system-linux-i686-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-linux-armhf/jars/netlib-native_system-linux-armhf-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-x86_64/jars/netlib-native_system-win-x86_64-1.1-natives.jar:/root/.ivy2/cache/com.github.fommil.netlib/netlib-native_system-win-i686/jars/netlib-native_system-win-i686-1.1-natives.jar
::/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar"
"-Xms4096M" "-Xmx4096M"



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


First sample with Spark Streaming and three Time's?

2014-05-18 Thread Jacek Laskowski
Hi,

I'm quite new to Spark Streaming and developed the following
application to pass 4 strings, process them and shut down:

val conf = new SparkConf(false) // skip loading external settings
  .setMaster("local[1]") // run locally with one thread
  .setAppName("Spark Streaming with Scala") // name in Spark web UI
val ssc = new StreamingContext(conf, Seconds(5))
val stream: ReceiverInputDStream[String] = ssc.receiverStream(
  new Receiver[String](StorageLevel.MEMORY_ONLY_SER_2) {
def onStart() {
  println("[ACTIVATOR] onStart called")
  store("one")
  store("two")
  store("three")
  store("four")
  stop("No more data...receiver stopped")
}

def onStop() {
  println("[ACTIVATOR] onStop called")
}
  }
)
stream.count().map(cnt => "Received " + cnt + " events.").print()

ssc.start()
// ssc.awaitTermination(1000)
val stopSparkContext, stopGracefully = true
ssc.stop(stopSparkContext, stopGracefully)

I'm running it with `xsbt 'runMain StreamingApp'` with xsbt and spark
build from the latest sources.

What I noticed is that the app generates:

14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(1, 0)
14/05/18 22:32:55 INFO DAGScheduler: Stage 1 (take at
DStream.scala:593) finished in 0.245 s
14/05/18 22:32:55 INFO SparkContext: Job finished: take at
DStream.scala:593, took 4.829798 s
---
Time: 140044517 ms
---

14/05/18 22:32:55 INFO DAGScheduler: Completed ResultTask(3, 0)
14/05/18 22:32:55 INFO DAGScheduler: Stage 3 (take at
DStream.scala:593) finished in 0.022 s
14/05/18 22:32:55 INFO SparkContext: Job finished: take at
DStream.scala:593, took 0.194738 s
---
Time: 1400445175000 ms
---

14/05/18 22:33:00 INFO DAGScheduler: Completed ResultTask(5, 0)
14/05/18 22:33:00 INFO DAGScheduler: Stage 5 (take at
DStream.scala:593) finished in 0.014 s
14/05/18 22:33:00 INFO SparkContext: Job finished: take at
DStream.scala:593, took 0.319387 s
---
Time: 140044518 ms
---

Why are there three jobs finished? I would expect one since after
`store` the app immediately calls `stop`. Can I have a single job that
would process these 4 `store`s?

Jacek

-- 
Jacek Laskowski | http://blog.japila.pl
"Never discourage anyone who continually makes progress, no matter how
slow." Plato


Spark Shell stuck on standalone mode

2014-05-18 Thread Sidharth Kashyap
Hi,
I have configured a cluster with 10 slaves and one master.
The master web portal shows all the slaves and looks to be rightly configured.
I started the master node with the command
MASTER=spark://:38955 $SPARK_HOME/bin/spark-shell
This brings in the REPL with the following message though
" 14/05/18 23:34:39 ERROR AppClient$ClientActor: Master removed our 
application: FAILED; stopping client"
scala> val textFile = sc.textFile("CHANGES.txt")textFile: 
org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12
scala> textFile.count()
and the control never comes out of the REPL as shown in the attachment.
Am I doing something wrong?
Please help
Regards,Sid 
  

Re: Text file and shuffle

2014-05-18 Thread Han JU
I think the shuffle is unavoidable given that the input partitions
(probably hadoop input spits in your case) are not arranged in the way of a
cogroup job. But maybe you can try:

  1) co-partition you data for cogroup:

val par = HashPartitioner(128)
val big = sc.textFile(..).map(...).partitionBy(par)
val small = sc.textFile(...).map(...).partitionBy(par)
...

  See discussion in
https://groups.google.com/forum/#!topic/spark-users/gUyCSoFo5RI

  2) since you have 25GB mem on each node, you can use the broadcast
variable in spark to distribute the smaller dataset on each node and do
cogroup with it.



2014-05-18 4:41 GMT+02:00 Puneet Lakhina :

> Hi,
>
> I'm new to spark and I wanted to understand a few things conceptually so
> that I can optimize my spark job. I have a large text file (~14G, 200k
> lines). This file is available on each worker node of my spark cluster. The
> job I run calls sc.textFile(...).flatmap(...) . The function that I pass
> into flat map splits up each line from the file into a key and value. Now I
> have another text file which is smaller in size(~1.5G) but has a lot more
> lines because it has more than one value per key spread across multiple
> lines. . I call the same textFile and flatmap functions on they other file
> and then call groupByKey to have all values for a key available as a list.
>
> Having done this I then cogroup these 2 RDDs. I have the following
> questions
>
> 1. Is this sequence of steps the best way to achieve what I want, I.e a
> join across the 2 data sets?
>
> 2. I have a 8 node (25 Gb memory each) . The large file flatmap spawns
> about 400 odd tasks whereas the small file flatmap only spawns about 30 odd
> tasks. The large file's flatmap takes about 2-3 mins and during this time
> it seems to do about 3G of shuffle write. I want to understand if this
> shuffle write is something I can avoid. From what I have read, the shuffle
> write is a disk write. Is that correct? Also is the reason for the shuffle
> write the fact that the partitioner for flatmap ends up having to
> redistribute the data across the cluster?
>
> Please let me know if I haven't provided enough information. I'm new to
> spark so if you see anything fundamental that I don't understand please
> feel free to just point me to a link that provides some detailed
> information.
>
> Thanks,
> Puneet




-- 
*JU Han*

Data Engineer @ Botify.com

+33 061960


Re: unsubscribe

2014-05-18 Thread Andrew Ash
Hi Shangyu (and everyone else looking to unsubscribe!),

If you'd like to get off this mailing list, please send an email to user
*-unsubscribe*@spark.apache.org, not the regular user@spark.apache.org list.

How to use the Apache mailing list infrastructure is documented here:
https://www.apache.org/foundation/mailinglists.html
And the Spark User list specifically can be found here:
http://mail-archives.apache.org/mod_mbox/spark-user/

Thanks!
Andrew


On Sun, May 18, 2014 at 12:39 PM, Shangyu Luo  wrote:

> Thanks!
>


Re: IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu
I tried hbase-0.96.2/0.98.1/0.98.2

HDFS version is 2.3 

-- 
Nan Zhu

On Sunday, May 18, 2014 at 4:18 PM, Nan Zhu wrote: 
> Hi, all 
> 
> I tried to write data to HBase in a Spark-1.0 rc8  application, 
> 
> the application is terminated due to the java.lang.IllegalAccessError, Hbase 
> shell works fine, and the same application works with a standalone Hbase 
> deployment
> 
> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
> at 
> org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
> at 
> org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
> at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
> at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
> at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
> at 
> org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
> at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
> at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
> at 
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
> at org.apache.spark.rdd.PairRDDFunctions.org 
> (http://org.apache.spark.rdd.PairRDDFunctions.org)$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> 
> 
> Can anyone give some hint to the issue?
> 
> Best, 
> 
> -- 
> Nan Zhu
> 



IllegelAccessError when writing to HBase?

2014-05-18 Thread Nan Zhu
Hi, all 

I tried to write data to HBase in a Spark-1.0 rc8  application, 

the application is terminated due to the java.lang.IllegalAccessError, Hbase 
shell works fine, and the same application works with a standalone Hbase 
deployment

java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString 
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:930)
at 
org.apache.hadoop.hbase.protobuf.RequestConverter.buildGetRowOrBeforeRequest(RequestConverter.java:133)
at 
org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1466)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1236)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1110)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1067)
at 
org.apache.hadoop.hbase.client.AsyncProcess.findDestLocation(AsyncProcess.java:356)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:301)
at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:955)
at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1239)
at org.apache.hadoop.hbase.client.HTable.close(HTable.java:1276)
at 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:112)
at 
org.apache.spark.rdd.PairRDDFunctions.org$apache$spark$rdd$PairRDDFunctions$$writeShard$1(PairRDDFunctions.scala:720)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:730)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Can anyone give some hint to the issue?

Best, 

-- 
Nan Zhu



unsubscribe

2014-05-18 Thread Shangyu Luo
Thanks!


Re: Passing runtime config to workers?

2014-05-18 Thread Robert James
I see - I didn't realize that scope would work like that.  Are you
saying that any variable that is in scope of the lambda passed to map
will be automagically propagated to all workers? What if it's not
explicitly referenced in the map, only used by it.  E.g.:

def main:
  settings.setSettings
  rdd.map(x => F.f(x))

object F {
  def f(...)...
  val settings:...
}

F.f accesses F.settings, like a Singleton.  The master sets F.settings
before using F.f in a map.  Will all workers have the same F.settings
as seen by F.f?



On 5/16/14, DB Tsai  wrote:
> Since the evn variables in driver will not be passed into workers, the most
> easy way you can do is refer to the variables directly in workers from
> driver.
>
> For example,
>
> val variableYouWantToUse = System.getenv("something defined in env")
>
> rdd.map(
> you can access `variableYouWantToUse` here
> )
>
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Fri, May 16, 2014 at 1:59 PM, Robert James
> wrote:
>
>> What is a good way to pass config variables to workers?
>>
>> I've tried setting them in environment variables via spark-env.sh, but,
>> as
>> far as I can tell, the environment variables set there don't appear in
>> workers' environments.  If I want to be able to configure all workers,
>> what's a good way to do it?  For example, I want to tell all workers:
>> USE_ALGO_A or USE_ALGO_B - but I don't want to recompile.
>>
>


Re: File list read into single RDD

2014-05-18 Thread Andrew Ash
Spark's 
sc.textFile()
method
delegates to sc.hadoopFile(), which uses Hadoop's
FileInputFormat.setInputPaths()call.
 There is no alternate storage system, Spark just delegates to Hadoop
for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so
you can use Spark on data in S3 using s3:/// just the same as you would
with HDFS.  See Apache's documentation on
S3 for
more details.

As far as interacting with a FileSystem (HDFS or other) to list files,
delete files, navigate paths, etc. from your driver program, you should be
able to just instantiate a FileSystem object and use the normal Hadoop APIs
from there.  The Apache getting started docs on reading/writing from Hadoop
DFS  should work
the same for non-HDFS examples too.

I do think we could use a little "recipe" in our documentation to make
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind
sharing, we can format it for including in future Spark docs.

Cheers!
Andrew


On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel  wrote:

> Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI.
> Since Spark supports several FS schemes I’m unclear about how much to
> assume about using the hadoop file systems APIs and conventions. Concretely
> if I pass a pattern in with a HTTPS file system, will the pattern work?
>
> How does Spark implement its storage system? This seems to be an
> abstraction level beyond what is available in HDFS. In order to preserve
> that flexibility what APIs should I be using? It would be easy to say, HDFS
> only and use HDFS APIs but that would seem to limit things. Especially
> where you would like to read from one cluster and write to another. This is
> not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.
>
> If I can stick to passing URIs to sc.textFile() I’m ok but if I need to
> examine the structure of the file system, I’m unclear how I should do it
> without sacrificing Spark’s flexibility.
>
> On Apr 29, 2014, at 12:55 AM, Christophe Préaud <
> christophe.pre...@kelkoo.com> wrote:
>
>  Hi,
>
> You can also use any path pattern as defined here:
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>
> e.g.:
>
> sc.textFile('{/path/to/file1,/path/to/file2}')
>
> Christophe.
>
> On 29/04/2014 05:07, Nicholas Chammas wrote:
>
> Not that I know of. We were discussing it on another thread and it came
> up.
>
>  I think if you look up the Hadoop FileInputFormat API (which Spark uses)
> you'll see it mentioned there in the docs.
>
>
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
>
>  But that's not obvious.
>
>  Nick
>
> 2014년 4월 28일 월요일, Pat Ferrel 님이 작성한 메시지:
>
>> Perfect.
>>
>>  BTW just so I know where to look next time, was that in some docs?
>>
>>   On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>  Yep, as I just found out, you can also provide sc.textFile() with a
>> comma-delimited string of all the files you want to load.
>>
>> For example:
>>
>> sc.textFile('/path/to/file1,/path/to/file2')
>>
>> So once you have your list of files, concatenate their paths like that
>> and pass the single string to textFile().
>>
>> Nick
>>
>>
>> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel  wrote:
>>
>>> sc.textFile(URI) supports reading multiple files in parallel but only
>>> with a wildcard. I need to walk a dir tree, match a regex to create a list
>>> of files, then I’d like to read them into a single RDD in parallel. I
>>> understand these could go into separate RDDs then a union RDD can be
>>> created. Is there a way to create a single RDD from a URI list?
>>
>>
>>
>>
>
> --
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>
>


Re: breeze DGEMM slow in spark

2014-05-18 Thread Xiangrui Meng
Can you attach the slave classpath? -Xiangrui

On Sun, May 18, 2014 at 2:02 AM, wxhsdp  wrote:
> Hi, xiangrui
>
>   you said "It doesn't work if you put the netlib-native jar inside an
> assembly
>   jar. Try to mark it "provided" in the dependencies, and use --jars to
>   include them with spark-submit. -Xiangrui"
>
>   i'am not use an assembly jar which contains every thing, i also mark
> breeze dependencies
>   provided, and manually download the jars and add them to slave classpath.
> but doesn't work:(
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5979.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: File list read into single RDD

2014-05-18 Thread Pat Ferrel
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since 
Spark supports several FS schemes I’m unclear about how much to assume about 
using the hadoop file systems APIs and conventions. Concretely if I pass a 
pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction 
level beyond what is available in HDFS. In order to preserve that flexibility 
what APIs should I be using? It would be easy to say, HDFS only and use HDFS 
APIs but that would seem to limit things. Especially where you would like to 
read from one cluster and write to another. This is not so easy to do inside 
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine 
the structure of the file system, I’m unclear how I should do it without 
sacrificing Spark’s flexibility.
 
On Apr 29, 2014, at 12:55 AM, Christophe Préaud  
wrote:

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
> Not that I know of. We were discussing it on another thread and it came up. 
> 
> I think if you look up the Hadoop FileInputFormat API (which Spark uses) 
> you'll see it mentioned there in the docs. 
> 
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
> 
> But that's not obvious.
> 
> Nick
> 
> 2014년 4월 28일 월요일, Pat Ferrel 님이 작성한 메시지:
> Perfect. 
> 
> BTW just so I know where to look next time, was that in some docs?
> 
> On Apr 28, 2014, at 7:04 PM, Nicholas Chammas  
> wrote:
> 
> Yep, as I just found out, you can also provide 
> sc.textFile() with a comma-delimited string of all the files you want to load.
> 
> For example:
> 
> sc.textFile('/path/to/file1,/path/to/file2')
> So once you have your list of files, concatenate their paths like that and 
> pass the single string to 
> textFile().
> 
> Nick
> 
> 
> 
> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel  wrote:
> sc.textFile(URI) supports reading multiple files in parallel but only with a 
> wildcard. I need to walk a dir tree, match a regex to create a list of files, 
> then I’d like to read them into a single RDD in parallel. I understand these 
> could go into separate RDDs then a union RDD can be created. Is there a way 
> to create a single RDD from a URI list?
> 
> 


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.



Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread Daniel Mahler
Hi Matei,

Thanks for the suggestions.
Is the number of partitions set by calling 'myrrd.partitionBy(new
HashPartitioner(N))'?
Is there some heuristic formula for choosing a good number of partitions?

thanks
Daniel




On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu  wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: Configuring Spark for reduceByKey on on massive data sets

2014-05-18 Thread lukas nalezenec
Hi
Try using *reduceByKeyLocally*.
Regards
Lukas Nalezenec


On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu  wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui

  you said "It doesn't work if you put the netlib-native jar inside an
assembly 
  jar. Try to mark it "provided" in the dependencies, and use --jars to 
  include them with spark-submit. -Xiangrui"

  i'am not use an assembly jar which contains every thing, i also mark
breeze dependencies
  provided, and manually download the jars and add them to slave classpath.
but doesn't work:(



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5979.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
in case 1, breeze dependency in sbt.build file automatically downloads the
jars and add them
to classpath.

in spark case, i manually download all the jars and add them to spark
classpath

why case 1 succeeded, and case 2 failed? do i miss something?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: breeze DGEMM slow in spark

2014-05-18 Thread wxhsdp
Hi, xiangrui
  i check the stderr of worker node, yes it's failed to load implementation
from:   
  com.github.fommil.netlib.NativeSystemBLAS...

  what do you mean by "include breeze-natives or netlib:all"? 

  things i've already done:
  1. add breeze and breeze native dependency in sbt build file
  2. download all breeze jars to slaves
  3. add jars to classpath in slave
  4. ln -s libopenblas_nehalemp-r0.2.9.rc2.so libblas.so.3 and add it to
LD_LIBRARY_PATH in slave

  thank you for your help



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/breeze-DGEMM-slow-in-spark-tp5950p5977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.