Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui

On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
 Hi Xiangrui,

 The block size limit was encountered even with reduced number of item blocks
 as you had expected. I'm wondering if I could try the new implementation as
 a standalone library against a 1.1 deployment. Does it have dependencies on
 any core API's in the current master?

 Thanks,
 Bharath

 On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:

 Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
 yes, I've been following the JIRA for the new ALS implementation. I'll try
 it out when it's ready for testing. .

 On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Bharath,

 You can try setting a small item blocks in this case. 1200 is
 definitely too large for ALS. Please try 30 or even smaller. I'm not
 sure whether this could solve the problem because you have 100 items
 connected with 10^8 users. There is a JIRA for this issue:

 https://issues.apache.org/jira/browse/SPARK-3735

 which I will try to implement in 1.3. I'll ping you when it is ready.

 Best,
 Xiangrui

 On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:
  Yes, the issue appears to be due to the 2GB block size limitation. I am
  hence looking for (user, product) block sizing suggestions to work
  around
  the block size limitation.
 
  On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  (It won't be that, since you see that the error occur when reading a
  block from disk. I think this is an instance of the 2GB block size
  limitation.)
 
  On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
  ilya.gane...@capitalone.com wrote:
   Hi Bharath – I’m unsure if this is your problem but the
   MatrixFactorizationModel in MLLIB which is the underlying component
   for
   ALS
   expects your User/Product fields to be integers. Specifically, the
   input
   to
   ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
   wondering if
   perhaps one of your identifiers exceeds MAX_INT, could you write a
   quick
   check for that?
  
   I have been running a very similar use case to yours (with more
   constrained
   hardware resources) and I haven’t seen this exact problem but I’m
   sure
   we’ve
   seen similar issues. Please let me know if you have other questions.
  
   From: Bharath Ravi Kumar reachb...@gmail.com
   Date: Thursday, November 27, 2014 at 1:30 PM
   To: user@spark.apache.org user@spark.apache.org
   Subject: ALS failure with size  Integer.MAX_VALUE
  
   We're training a recommender with ALS in mllib 1.1 against a dataset
   of
   150M
   users and 4.5K items, with the total number of training records
   being
   1.2
   Billion (~30GB data). The input data is spread across 1200
   partitions on
   HDFS. For the training, rank=10, and we've configured {number of
   user
   data
   blocks = number of item data blocks}. The number of user/item blocks
   was
   varied  between 50 to 1200. Irrespective of the block size (e.g. at
   1200
   blocks each), there are atleast a couple of tasks that end up
   shuffle
   reading  9.7G each in the aggregate stage (ALS.scala:337) and
   failing
   with
   the following exception:
  
   java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
   at
   org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
   at
   org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
   at
  
  
   org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
   at
  
  
   org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
  
 
 




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ALS failure with size Integer.MAX_VALUE

2014-12-15 Thread Bharath Ravi Kumar
Ok. We'll try using it in a test cluster running 1.2.
On 16-Dec-2014 1:36 am, Xiangrui Meng men...@gmail.com wrote:

Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui

On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
 Hi Xiangrui,

 The block size limit was encountered even with reduced number of item
blocks
 as you had expected. I'm wondering if I could try the new implementation
as
 a standalone library against a 1.1 deployment. Does it have dependencies
on
 any core API's in the current master?

 Thanks,
 Bharath

 On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:

 Thanks Xiangrui. I'll try out setting a smaller number of item blocks.
And
 yes, I've been following the JIRA for the new ALS implementation. I'll
try
 it out when it's ready for testing. .

 On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Bharath,

 You can try setting a small item blocks in this case. 1200 is
 definitely too large for ALS. Please try 30 or even smaller. I'm not
 sure whether this could solve the problem because you have 100 items
 connected with 10^8 users. There is a JIRA for this issue:

 https://issues.apache.org/jira/browse/SPARK-3735

 which I will try to implement in 1.3. I'll ping you when it is ready.

 Best,
 Xiangrui

 On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com

 wrote:
  Yes, the issue appears to be due to the 2GB block size limitation. I
am
  hence looking for (user, product) block sizing suggestions to work
  around
  the block size limitation.
 
  On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  (It won't be that, since you see that the error occur when reading a
  block from disk. I think this is an instance of the 2GB block size
  limitation.)
 
  On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
  ilya.gane...@capitalone.com wrote:
   Hi Bharath – I’m unsure if this is your problem but the
   MatrixFactorizationModel in MLLIB which is the underlying component
   for
   ALS
   expects your User/Product fields to be integers. Specifically, the
   input
   to
   ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
   wondering if
   perhaps one of your identifiers exceeds MAX_INT, could you write a
   quick
   check for that?
  
   I have been running a very similar use case to yours (with more
   constrained
   hardware resources) and I haven’t seen this exact problem but I’m
   sure
   we’ve
   seen similar issues. Please let me know if you have other
questions.
  
   From: Bharath Ravi Kumar reachb...@gmail.com
   Date: Thursday, November 27, 2014 at 1:30 PM
   To: user@spark.apache.org user@spark.apache.org
   Subject: ALS failure with size  Integer.MAX_VALUE
  
   We're training a recommender with ALS in mllib 1.1 against a
dataset
   of
   150M
   users and 4.5K items, with the total number of training records
   being
   1.2
   Billion (~30GB data). The input data is spread across 1200
   partitions on
   HDFS. For the training, rank=10, and we've configured {number of
   user
   data
   blocks = number of item data blocks}. The number of user/item
blocks
   was
   varied  between 50 to 1200. Irrespective of the block size (e.g. at
   1200
   blocks each), there are atleast a couple of tasks that end up
   shuffle
   reading  9.7G each in the aggregate stage (ALS.scala:337) and
   failing
   with
   the following exception:
  
   java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
   at
   org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
   at
   org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
   at
  
  
  
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
   at
  
  
  
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
  
 
 





Re: ALS failure with size Integer.MAX_VALUE

2014-12-14 Thread Bharath Ravi Kumar
Hi Xiangrui,

The block size limit was encountered even with reduced number of item
blocks as you had expected. I'm wondering if I could try the new
implementation as a standalone library against a 1.1 deployment. Does it
have dependencies on any core API's in the current master?

Thanks,
Bharath

On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

 Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
 yes, I've been following the JIRA for the new ALS implementation. I'll try
 it out when it's ready for testing. .

 On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Bharath,

 You can try setting a small item blocks in this case. 1200 is
 definitely too large for ALS. Please try 30 or even smaller. I'm not
 sure whether this could solve the problem because you have 100 items
 connected with 10^8 users. There is a JIRA for this issue:

 https://issues.apache.org/jira/browse/SPARK-3735

 which I will try to implement in 1.3. I'll ping you when it is ready.

 Best,
 Xiangrui

 On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:
  Yes, the issue appears to be due to the 2GB block size limitation. I am
  hence looking for (user, product) block sizing suggestions to work
 around
  the block size limitation.
 
  On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  (It won't be that, since you see that the error occur when reading a
  block from disk. I think this is an instance of the 2GB block size
  limitation.)
 
  On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
  ilya.gane...@capitalone.com wrote:
   Hi Bharath – I’m unsure if this is your problem but the
   MatrixFactorizationModel in MLLIB which is the underlying component
 for
   ALS
   expects your User/Product fields to be integers. Specifically, the
 input
   to
   ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
   wondering if
   perhaps one of your identifiers exceeds MAX_INT, could you write a
 quick
   check for that?
  
   I have been running a very similar use case to yours (with more
   constrained
   hardware resources) and I haven’t seen this exact problem but I’m
 sure
   we’ve
   seen similar issues. Please let me know if you have other questions.
  
   From: Bharath Ravi Kumar reachb...@gmail.com
   Date: Thursday, November 27, 2014 at 1:30 PM
   To: user@spark.apache.org user@spark.apache.org
   Subject: ALS failure with size  Integer.MAX_VALUE
  
   We're training a recommender with ALS in mllib 1.1 against a dataset
 of
   150M
   users and 4.5K items, with the total number of training records being
   1.2
   Billion (~30GB data). The input data is spread across 1200
 partitions on
   HDFS. For the training, rank=10, and we've configured {number of user
   data
   blocks = number of item data blocks}. The number of user/item blocks
 was
   varied  between 50 to 1200. Irrespective of the block size (e.g. at
 1200
   blocks each), there are atleast a couple of tasks that end up shuffle
   reading  9.7G each in the aggregate stage (ALS.scala:337) and
 failing
   with
   the following exception:
  
   java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
   at
   org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
   at
   org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
   at
  
  
 org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
   at
  
  
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
  
 
 





Re: ALS failure with size Integer.MAX_VALUE

2014-12-03 Thread Bharath Ravi Kumar
Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
yes, I've been following the JIRA for the new ALS implementation. I'll try
it out when it's ready for testing. .

On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Bharath,

 You can try setting a small item blocks in this case. 1200 is
 definitely too large for ALS. Please try 30 or even smaller. I'm not
 sure whether this could solve the problem because you have 100 items
 connected with 10^8 users. There is a JIRA for this issue:

 https://issues.apache.org/jira/browse/SPARK-3735

 which I will try to implement in 1.3. I'll ping you when it is ready.

 Best,
 Xiangrui

 On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com
 wrote:
  Yes, the issue appears to be due to the 2GB block size limitation. I am
  hence looking for (user, product) block sizing suggestions to work around
  the block size limitation.
 
  On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  (It won't be that, since you see that the error occur when reading a
  block from disk. I think this is an instance of the 2GB block size
  limitation.)
 
  On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
  ilya.gane...@capitalone.com wrote:
   Hi Bharath – I’m unsure if this is your problem but the
   MatrixFactorizationModel in MLLIB which is the underlying component
 for
   ALS
   expects your User/Product fields to be integers. Specifically, the
 input
   to
   ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
   wondering if
   perhaps one of your identifiers exceeds MAX_INT, could you write a
 quick
   check for that?
  
   I have been running a very similar use case to yours (with more
   constrained
   hardware resources) and I haven’t seen this exact problem but I’m sure
   we’ve
   seen similar issues. Please let me know if you have other questions.
  
   From: Bharath Ravi Kumar reachb...@gmail.com
   Date: Thursday, November 27, 2014 at 1:30 PM
   To: user@spark.apache.org user@spark.apache.org
   Subject: ALS failure with size  Integer.MAX_VALUE
  
   We're training a recommender with ALS in mllib 1.1 against a dataset
 of
   150M
   users and 4.5K items, with the total number of training records being
   1.2
   Billion (~30GB data). The input data is spread across 1200 partitions
 on
   HDFS. For the training, rank=10, and we've configured {number of user
   data
   blocks = number of item data blocks}. The number of user/item blocks
 was
   varied  between 50 to 1200. Irrespective of the block size (e.g. at
 1200
   blocks each), there are atleast a couple of tasks that end up shuffle
   reading  9.7G each in the aggregate stage (ALS.scala:337) and failing
   with
   the following exception:
  
   java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
   at
   org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
   at
   org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
   at
  
  
 org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
   at
  
  
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
  
 
 



Re: ALS failure with size Integer.MAX_VALUE

2014-12-02 Thread Xiangrui Meng
Hi Bharath,

You can try setting a small item blocks in this case. 1200 is
definitely too large for ALS. Please try 30 or even smaller. I'm not
sure whether this could solve the problem because you have 100 items
connected with 10^8 users. There is a JIRA for this issue:

https://issues.apache.org/jira/browse/SPARK-3735

which I will try to implement in 1.3. I'll ping you when it is ready.

Best,
Xiangrui

On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar reachb...@gmail.com wrote:
 Yes, the issue appears to be due to the 2GB block size limitation. I am
 hence looking for (user, product) block sizing suggestions to work around
 the block size limitation.

 On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

 (It won't be that, since you see that the error occur when reading a
 block from disk. I think this is an instance of the 2GB block size
 limitation.)

 On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
 ilya.gane...@capitalone.com wrote:
  Hi Bharath – I’m unsure if this is your problem but the
  MatrixFactorizationModel in MLLIB which is the underlying component for
  ALS
  expects your User/Product fields to be integers. Specifically, the input
  to
  ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
  wondering if
  perhaps one of your identifiers exceeds MAX_INT, could you write a quick
  check for that?
 
  I have been running a very similar use case to yours (with more
  constrained
  hardware resources) and I haven’t seen this exact problem but I’m sure
  we’ve
  seen similar issues. Please let me know if you have other questions.
 
  From: Bharath Ravi Kumar reachb...@gmail.com
  Date: Thursday, November 27, 2014 at 1:30 PM
  To: user@spark.apache.org user@spark.apache.org
  Subject: ALS failure with size  Integer.MAX_VALUE
 
  We're training a recommender with ALS in mllib 1.1 against a dataset of
  150M
  users and 4.5K items, with the total number of training records being
  1.2
  Billion (~30GB data). The input data is spread across 1200 partitions on
  HDFS. For the training, rank=10, and we've configured {number of user
  data
  blocks = number of item data blocks}. The number of user/item blocks was
  varied  between 50 to 1200. Irrespective of the block size (e.g. at 1200
  blocks each), there are atleast a couple of tasks that end up shuffle
  reading  9.7G each in the aggregate stage (ALS.scala:337) and failing
  with
  the following exception:
 
  java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
  at
  org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
  at
  org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
  at
 
  org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
  at
 
  org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ALS failure with size Integer.MAX_VALUE

2014-12-01 Thread Bharath Ravi Kumar
Yes, the issue appears to be due to the 2GB block size limitation. I am
hence looking for (user, product) block sizing suggestions to work around
the block size limitation.

On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

 (It won't be that, since you see that the error occur when reading a
 block from disk. I think this is an instance of the 2GB block size
 limitation.)

 On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya
 ilya.gane...@capitalone.com wrote:
  Hi Bharath – I’m unsure if this is your problem but the
  MatrixFactorizationModel in MLLIB which is the underlying component for
 ALS
  expects your User/Product fields to be integers. Specifically, the input
 to
  ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
 wondering if
  perhaps one of your identifiers exceeds MAX_INT, could you write a quick
  check for that?
 
  I have been running a very similar use case to yours (with more
 constrained
  hardware resources) and I haven’t seen this exact problem but I’m sure
 we’ve
  seen similar issues. Please let me know if you have other questions.
 
  From: Bharath Ravi Kumar reachb...@gmail.com
  Date: Thursday, November 27, 2014 at 1:30 PM
  To: user@spark.apache.org user@spark.apache.org
  Subject: ALS failure with size  Integer.MAX_VALUE
 
  We're training a recommender with ALS in mllib 1.1 against a dataset of
 150M
  users and 4.5K items, with the total number of training records being 1.2
  Billion (~30GB data). The input data is spread across 1200 partitions on
  HDFS. For the training, rank=10, and we've configured {number of user
 data
  blocks = number of item data blocks}. The number of user/item blocks was
  varied  between 50 to 1200. Irrespective of the block size (e.g. at 1200
  blocks each), there are atleast a couple of tasks that end up shuffle
  reading  9.7G each in the aggregate stage (ALS.scala:337) and failing
 with
  the following exception:
 
  java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
  at
 org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
  at
 org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
  at
 
 org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
  at
 
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)
 



Re: ALS failure with size Integer.MAX_VALUE

2014-11-29 Thread Ganelin, Ilya
Hi Bharath – I’m unsure if this is your problem but the 
MatrixFactorizationModel in MLLIB which is the underlying component for ALS 
expects your User/Product fields to be integers. Specifically, the input to ALS 
is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if 
perhaps one of your identifiers exceeds MAX_INT, could you write a quick check 
for that?

I have been running a very similar use case to yours (with more constrained 
hardware resources) and I haven’t seen this exact problem but I’m sure we’ve 
seen similar issues. Please let me know if you have other questions.

From: Bharath Ravi Kumar reachb...@gmail.commailto:reachb...@gmail.com
Date: Thursday, November 27, 2014 at 1:30 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: ALS failure with size  Integer.MAX_VALUE

We're training a recommender with ALS in mllib 1.1 against a dataset of 150M 
users and 4.5K items, with the total number of training records being 1.2 
Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. 
For the training, rank=10, and we've configured {number of user data blocks = 
number of item data blocks}. The number of user/item blocks was varied  between 
50 to 1200. Irrespective of the block size (e.g. at 1200 blocks each), there 
are atleast a couple of tasks that end up shuffle reading  9.7G each in the 
aggregate stage (ALS.scala:337) and failing with the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at 
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at 
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)



As for the data, on the user side, the degree of a node in the connectivity 
graph is relatively small. However, on the item side, 3.8K out of the 4.5K 
items are connected to 10^5 users each on an average, with 100 items being 
connected to nearly 10^8 users. The rest of the items are connected to less 
than 10^5 users. With such a skew in the connectivity graph, I'm unsure if 
additional memory or variation in the block sizes would help (considering my 
limited understanding of the implementation in mllib). Any suggestion to 
address the problem?


The test is being run on a standalone cluster of 3 hosts, each with 100G RAM  
24 cores dedicated to the application. The additional configs I made specific 
to the shuffle and task failure reduction are as follows:

spark.core.connection.ack.wait.timeout=600
spark.shuffle.consolidateFiles=true
spark.shuffle.manager=SORT


The job execution summary is as follows:

Active Stages:

Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200 (3 
failed), Shuffle Read :  141.6 GB

Completed Stages (5)
Stage IdDescriptionDuration
Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min 
1200/120029.9 GB1668.4 MB186.8 GB

5mapPartitionsWithIndex at ALS.scala:250 +details

3map at ALS.scala:231

0aggregate at ALS.scala:337 +details

1map at ALS.scala:228 +details


Thanks,
Bharath


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: ALS failure with size Integer.MAX_VALUE

2014-11-28 Thread Bharath Ravi Kumar
Any suggestions to address the described problem? In particular, it appears
that considering the skewed degree of some of the item nodes in the graph,
I believe it should be possible to define better block sizes to reflect
that fact, but am unsure of the way of arriving at the sizes accordingly.

Thanks,
Bharath

On Fri, Nov 28, 2014 at 12:00 AM, Bharath Ravi Kumar reachb...@gmail.com
wrote:

 We're training a recommender with ALS in mllib 1.1 against a dataset of
 150M users and 4.5K items, with the total number of training records being
 1.2 Billion (~30GB data). The input data is spread across 1200 partitions
 on HDFS. For the training, rank=10, and we've configured {number of user
 data blocks = number of item data blocks}. The number of user/item blocks
 was varied  between 50 to 1200. Irrespective of the block size (e.g. at
 1200 blocks each), there are atleast a couple of tasks that end up shuffle
 reading  9.7G each in the aggregate stage (ALS.scala:337) and failing with
 the following exception:

 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
 at
 org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
 at
 org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
 at
 org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)




 As for the data, on the user side, the degree of a node in the
 connectivity graph is relatively small. However, on the item side, 3.8K out
 of the 4.5K items are connected to 10^5 users each on an average, with 100
 items being connected to nearly 10^8 users. The rest of the items are
 connected to less than 10^5 users. With such a skew in the connectivity
 graph, I'm unsure if additional memory or variation in the block sizes
 would help (considering my limited understanding of the implementation in
 mllib). Any suggestion to address the problem?


 The test is being run on a standalone cluster of 3 hosts, each with 100G
 RAM  24 cores dedicated to the application. The additional configs I made
 specific to the shuffle and task failure reduction are as follows:

 spark.core.connection.ack.wait.timeout=600
 spark.shuffle.consolidateFiles=true
 spark.shuffle.manager=SORT


 The job execution summary is as follows:

 Active Stages:

 Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200
 (3 failed), Shuffle Read :  141.6 GB

 Completed Stages (5)
 Stage IdDescriptionDuration
 Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
 6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min
  1200/120029.9 GB1668.4 MB186.8 GB

 5mapPartitionsWithIndex at ALS.scala:250 +details

 3map at ALS.scala:231

 0aggregate at ALS.scala:337 +details

 1map at ALS.scala:228 +details


 Thanks,
 Bharath



ALS failure with size Integer.MAX_VALUE

2014-11-27 Thread Bharath Ravi Kumar
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data blocks = number of item data blocks}. The number of user/item blocks
was varied  between 50 to 1200. Irrespective of the block size (e.g. at
1200 blocks each), there are atleast a couple of tasks that end up shuffle
reading  9.7G each in the aggregate stage (ALS.scala:337) and failing with
the following exception:

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:108)
at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:124)
at
org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:332)
at
org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204)




As for the data, on the user side, the degree of a node in the connectivity
graph is relatively small. However, on the item side, 3.8K out of the 4.5K
items are connected to 10^5 users each on an average, with 100 items being
connected to nearly 10^8 users. The rest of the items are connected to less
than 10^5 users. With such a skew in the connectivity graph, I'm unsure if
additional memory or variation in the block sizes would help (considering
my limited understanding of the implementation in mllib). Any suggestion to
address the problem?


The test is being run on a standalone cluster of 3 hosts, each with 100G
RAM  24 cores dedicated to the application. The additional configs I made
specific to the shuffle and task failure reduction are as follows:

spark.core.connection.ack.wait.timeout=600
spark.shuffle.consolidateFiles=true
spark.shuffle.manager=SORT


The job execution summary is as follows:

Active Stages:

Stage id 2,  aggregate at ALS.scala:337, duration 55 min, Tasks 1197/1200
(3 failed), Shuffle Read :  141.6 GB

Completed Stages (5)
Stage IdDescriptionDuration
Tasks: Succeeded/TotalInputShuffle ReadShuffle Write
6org.apache.spark.rdd.RDD.flatMap(RDD.scala:277) 12 min
1200/120029.9 GB1668.4 MB186.8 GB

5mapPartitionsWithIndex at ALS.scala:250 +details

3map at ALS.scala:231

0aggregate at ALS.scala:337 +details

1map at ALS.scala:228 +details


Thanks,
Bharath