RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Thanks Reza. It makes perfect sense.

Regards,
Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Thursday, March 19, 2015 11:58 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a 
single row is dense, that can end up overwhelming a machine. You can push that 
up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: 
so it scales linearly and across cluster with rows, but still quadratically 
with the number of columns. I will be updating the documentation to make this 
clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 
mailto:mgupt...@sapient.com>> wrote:
Hi Reza,

Behavior:

• I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0>
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

• For threshold 200+ (tried up to 1000) it gave an error (here 
 was different for different thresholds)
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
at EntitySimilarity$.main(EntitySimilarity.scala:80)
at EntitySimilarity.main(EntitySimilarity.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

• If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data & environment:

• RowMatrix of size 43345 X 56431

• In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

• I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.com<mailto:mgupt...@sapient.com>]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mailto:mgupt...@sapient.com>> wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the a

Re: Column Similarity using DIMSUM

2015-03-19 Thread Reza Zadeh
Hi Manish,
With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn.
When a single row is dense, that can end up overwhelming a machine. You can
push that up with more RAM, but note that DIMSUM is meant for tall and
skinny matrices: so it scales linearly and across cluster with rows, but
still quadratically with the number of columns. I will be updating the
documentation to make this clear.
Best,
Reza

On Thu, Mar 19, 2015 at 3:46 AM, Manish Gupta 8 
wrote:

>  Hi Reza,
>
>
>
> *Behavior*:
>
> · I tried running the job with different thresholds - 0.1, 0.5,
> 5, 20 & 100.  Every time, the job got stuck at mapPartitionsWithIndex at
> RowMatrix.scala:522
> <http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0>
>  with
> all workers running on 100% CPU. There is hardly any shuffle read/write
> happening. And after some time, “ERROR YarnClientClusterScheduler: Lost
> executor” start showing (maybe because of the nodes running on 100% CPU).
>
> · For threshold 200+ (tried up to 1000) it gave an error (here
>  was different for different thresholds)
>
> Exception in thread "main" java.lang.IllegalArgumentException: requirement
> failed: Oversampling should be greater than 1: 0.
>
> at scala.Predef$.require(Predef.scala:233)
>
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
>
> at
> org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
>
> at
> EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
>
> at EntitySimilarity$.main(EntitySimilarity.scala:80)
>
> at EntitySimilarity.main(EntitySimilarity.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> · If I get rid of frequently occurring attributes and keep only
> those attributes which are occurring in at 2% entities, then job doesn’t
> stuck / fail.
>
>
>
> *Data & environment*:
>
> · RowMatrix of size 43345 X 56431
>
> · In the matrix there are couple of rows, whose value is same in
> up to 50% of the columns (frequently occurring attributes).
>
> · I am running this, on one of our Dev cluster running on CDH
> 5.3.0 5 data nodes (each 4-core and 16GB RAM).
>
>
>
> My question – Do you think this is a hardware size issue and we should
> test it on larger machines?
>
>
>
> Regards,
>
> Manish
>
>
>
> *From:* Manish Gupta 8 [mailto:mgupt...@sapient.com]
> *Sent:* Wednesday, March 18, 2015 11:20 PM
> *To:* Reza Zadeh
> *Cc:* user@spark.apache.org
> *Subject:* RE: Column Similarity using DIMSUM
>
>
>
> Hi Reza,
>
>
>
> I have tried threshold to be only in the range of 0 to 1. I was not aware
> that threshold can be set to above 1.
>
> Will try and update.
>
>
>
> Thank You
>
>
>
> - Manish
>
>
>
> *From:* Reza Zadeh [mailto:r...@databricks.com ]
> *Sent:* Wednesday, March 18, 2015 10:55 PM
> *To:* Manish Gupta 8
> *Cc:* user@spark.apache.org
> *Subject:* Re: Column Similarity using DIMSUM
>
>
>
> Hi Manish,
>
> Did you try calling columnSimilarities(threshold) with different threshold
> values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
>
> Best,
>
> Reza
>
>
>
> On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
> wrote:
>
>   Hi,
>
>
>
> I am running Column Similarity (All Pairs Similarity using DIMSUM) in
> Spark on a dataset that looks like (Entity, Attribute, Value) after
> transforming the same to a row-oriented dense matrix format (one line per
> Attribute, one column per Entity, each cell with normalized value – between
> 0 and 1).
>
>
>
> It runs extremely fast in computing similarities between Entities in most
> of the case, but if there is even a single attribute which is frequently
> occurring across the entities (say in 30% of entities), j

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Hi Reza,

Behavior:

· I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 
100.  Every time, the job got stuck at mapPartitionsWithIndex at 
RowMatrix.scala:522<http://del2l379java.sapient.com:8088/proxy/application_1426267549766_0101/stages/stage?id=118&attempt=0>
 with all workers running on 100% CPU. There is hardly any shuffle read/write 
happening. And after some time, “ERROR YarnClientClusterScheduler: Lost 
executor” start showing (maybe because of the nodes running on 100% CPU).

· For threshold 200+ (tried up to 1000) it gave an error (here 
 was different for different thresholds)
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Oversampling should be greater than 1: 0.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilaritiesDIMSUM(RowMatrix.scala:511)
at 
org.apache.spark.mllib.linalg.distributed.RowMatrix.columnSimilarities(RowMatrix.scala:492)
at EntitySimilarity$.runSimilarity(EntitySimilarity.scala:241)
at EntitySimilarity$.main(EntitySimilarity.scala:80)
at EntitySimilarity.main(EntitySimilarity.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

· If I get rid of frequently occurring attributes and keep only those 
attributes which are occurring in at 2% entities, then job doesn’t stuck / fail.

Data & environment:

· RowMatrix of size 43345 X 56431

· In the matrix there are couple of rows, whose value is same in up to 
50% of the columns (frequently occurring attributes).

· I am running this, on one of our Dev cluster running on CDH 5.3.0 5 
data nodes (each 4-core and 16GB RAM).

My question – Do you think this is a hardware size issue and we should test it 
on larger machines?

Regards,
Manish

From: Manish Gupta 8 [mailto:mgupt...@sapient.com]
Sent: Wednesday, March 18, 2015 11:20 PM
To: Reza Zadeh
Cc: user@spark.apache.org
Subject: RE: Column Similarity using DIMSUM

Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mailto:mgupt...@sapient.com>> wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all < 2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta





RE: Column Similarity using DIMSUM

2015-03-18 Thread Manish Gupta 8
Hi Reza,

I have tried threshold to be only in the range of 0 to 1. I was not aware that 
threshold can be set to above 1.
Will try and update.

Thank You

- Manish

From: Reza Zadeh [mailto:r...@databricks.com]
Sent: Wednesday, March 18, 2015 10:55 PM
To: Manish Gupta 8
Cc: user@spark.apache.org
Subject: Re: Column Similarity using DIMSUM

Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold 
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
mailto:mgupt...@sapient.com>> wrote:
Hi,

I am running Column Similarity (All Pairs Similarity using DIMSUM) in Spark on 
a dataset that looks like (Entity, Attribute, Value) after transforming the 
same to a row-oriented dense matrix format (one line per Attribute, one column 
per Entity, each cell with normalized value – between 0 and 1).

It runs extremely fast in computing similarities between Entities in most of 
the case, but if there is even a single attribute which is frequently occurring 
across the entities (say in 30% of entities), job falls apart. Whole job get 
stuck and worker nodes start running on 100% CPU without making any progress on 
the job stage. If the dataset is very small (in the range of 1000 Entities X 
500 attributes (some frequently occurring)) the job finishes but takes too long 
(some time it gives GC errors too).

If none of the attribute is frequently occurring (all < 2%), then job runs in a 
lightning fast manner (even for 100 Entities X 1 attributes) and 
results are very accurate.

I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores and 
16GB of RAM.

My question is - Is this behavior expected for datasets where some Attributes 
frequently occur?

Thanks,
Manish Gupta





Re: Column Similarity using DIMSUM

2015-03-18 Thread Reza Zadeh
Hi Manish,
Did you try calling columnSimilarities(threshold) with different threshold
values? You try threshold values of 0.1, 0.5, 1, and 20, and higher.
Best,
Reza

On Wed, Mar 18, 2015 at 10:40 AM, Manish Gupta 8 
wrote:

>   Hi,
>
>
>
> I am running Column Similarity (All Pairs Similarity using DIMSUM) in
> Spark on a dataset that looks like (Entity, Attribute, Value) after
> transforming the same to a row-oriented dense matrix format (one line per
> Attribute, one column per Entity, each cell with normalized value – between
> 0 and 1).
>
>
>
> It runs extremely fast in computing similarities between Entities in most
> of the case, but if there is even a single attribute which is frequently
> occurring across the entities (say in 30% of entities), job falls apart.
> Whole job get stuck and worker nodes start running on 100% CPU without
> making any progress on the job stage. If the dataset is very small (in the
> range of 1000 Entities X 500 attributes (some frequently occurring)) the
> job finishes but takes too long (some time it gives GC errors too).
>
>
>
> If none of the attribute is frequently occurring (all < 2%), then job runs
> in a lightning fast manner (even for 100 Entities X 1 attributes)
> and results are very accurate.
>
>
>
> I am running Spark 1.2.0-cdh5.3.0 on 11 node cluster each having 4 cores
> and 16GB of RAM.
>
>
>
> My question is - *Is this behavior expected for datasets where some
> Attributes frequently occur*?
>
>
>
> Thanks,
>
> Manish Gupta
>
>
>
>
>