t the SVD of the input matrix to the first; EOF is another name for
> PCA).
>
> This takes about 30 minutes to compute the top 20 PCs of a 46.7K-by-6.3M
> dense matrix of doubles (~2 Tb), with most of the time spent on the
> distributed matrix-vector multiplies.
>
> Best,
> Al
Any suggestion/opinion?
On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar" wrote:
> We're running PCA (selecting 100 principal components) on a dataset that
> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
> matrix in question is mostly sparse with ten
We're running PCA (selecting 100 principal components) on a dataset that
has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
matrix in question is mostly sparse with tens of columns populate in most
rows, but a few rows with thousands of columns populated. We're running
spark on m
To be precise, the MesosExecutorBackend's Xms & Xmx equal
spark.executor.memory. So there's no question of expanding or contracting
the memory held by the executor.
On Sat, Oct 17, 2015 at 5:38 PM, Bharath Ravi Kumar
wrote:
> David, Tom,
>
> Thanks for the explan
t way to solve this is to use a higher
> level tool that can run your spark jobs through one mesos framework and
> then you can let spark distribute the resources more effectively.
>
> I hope that helps!
>
> Tom.
>
> On 17 Oct 2015, at 06:47, Bharath Ravi Kumar wrote:
>
>
Can someone respond if you're aware of the reason for such a memory
footprint? It seems unintuitive and hard to reason about.
Thanks,
Bharath
On Thu, Oct 15, 2015 at 12:29 PM, Bharath Ravi Kumar
wrote:
> Resending since user@mesos bounced earlier. My apologies.
>
> On Thu, Oct
Resending since user@mesos bounced earlier. My apologies.
On Thu, Oct 15, 2015 at 12:19 PM, Bharath Ravi Kumar
wrote:
> (Reviving this thread since I ran into similar issues...)
>
> I'm running two spark jobs (in mesos fine grained mode), each belonging to
> a different mesos
(Reviving this thread since I ran into similar issues...)
I'm running two spark jobs (in mesos fine grained mode), each belonging to
a different mesos role, say low and high. The low:high mesos weights are
1:10. On expected lines, I see that the low priority job occupies cluster
resources to the m
A follow up : considering that spark on mesos is indeed important to
databricks, its partners and the community, fundamental issues like
spark-6284 shouldn't be languishing for this long. A mesos cluster hosting
diverse (i.e.multi-tenant) workloads is a common scenario in production
for serious us
in
> http://spark.apache.org/docs/latest/running-on-yarn.html
> Then I can see exactly whats in the directory.
>
> Doug
>
> ps Sorry for the dup message Bharath and Todd, used wrong email address.
>
>
> > On Mar 19, 2015, at 1:19 AM, Bharath Ravi Kumar
> wrote:
3.2 but that was for a cloudera
> installation. I am not sure what the HDP version would be to put here.
>
> -Todd
>
> On Wed, Mar 18, 2015 at 12:49 AM, Bharath Ravi Kumar
> wrote:
>
>> Hi Todd,
>>
>> Yes, those entries were present in the conf under the same S
n your $SPARK_HOME/conf/spark-defaults.conf
> file?
>
> spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0-2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0-2041
>
>
>
>
> On Tue, Mar 17, 2015 at 1:04 AM, Bharath Ravi Kumar
> wrote:
>
>> Still no luck
Still no luck running purpose-built 1.3 against HDP 2.2 after following all
the instructions. Anyone else faced this issue?
On Mon, Mar 16, 2015 at 8:53 PM, Bharath Ravi Kumar
wrote:
> Hi Todd,
>
> Thanks for the help. I'll try again after building a distribution with the
> 1.3
apache-spark-hdp/
>
> FWIW spark-1.3.0 appears to be working fine with HDP as well and steps 2a
> and 2b are not required.
>
> HTH
>
> -Todd
>
> On Mon, Mar 16, 2015 at 10:13 AM, Bharath Ravi Kumar
> wrote:
>
>> Hi,
>>
>> Trying to run spark ( 1.2.1
Hi,
Trying to run spark ( 1.2.1 built for hdp 2.2) against a yarn cluster
results in the AM failing to start with following error on stderr:
Error: Could not find or load main class
org.apache.spark.deploy.yarn.ExecutorLauncher
An application id was assigned to the job, but there were no logs.
Not
Ok. We'll try using it in a test cluster running 1.2.
On 16-Dec-2014 1:36 am, "Xiangrui Meng" wrote:
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui
On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar
wrote:
> Hi Xiangrui,
>
> The block size limit w
s,
Bharath
On Wed, Dec 3, 2014 at 10:10 PM, Bharath Ravi Kumar
wrote:
>
> Thanks Xiangrui. I'll try out setting a smaller number of item blocks. And
> yes, I've been following the JIRA for the new ALS implementation. I'll try
> it out when it's ready for tes
pache.org/jira/browse/SPARK-3735
>
> which I will try to implement in 1.3. I'll ping you when it is ready.
>
> Best,
> Xiangrui
>
> On Tue, Dec 2, 2014 at 10:40 AM, Bharath Ravi Kumar
> wrote:
> > Yes, the issue appears to be due to the 2GB block size limitation
check for that?
> >
> > I have been running a very similar use case to yours (with more
> constrained
> > hardware resources) and I haven’t seen this exact problem but I’m sure
> we’ve
> > seen similar issues. Please let me know if you have other questions
.
Thanks,
Bharath
On Fri, Nov 28, 2014 at 12:00 AM, Bharath Ravi Kumar
wrote:
> We're training a recommender with ALS in mllib 1.1 against a dataset of
> 150M users and 4.5K items, with the total number of training records being
> 1.2 Billion (~30GB data). The input data is spre
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data
approach. My bad.
On Mon, Nov 3, 2014 at 3:38 PM, Bharath Ravi Kumar
wrote:
> The result was no different with saveAsHadoopFile. In both cases, I can
> see that I've misinterpreted the API docs. I'll explore the API's a bit
> further for ways to save the iterable as chun
quot;save every element of the RDD as one line of text".
> It works like TextOutputFormat in Hadoop MapReduce since that's what
> it uses. So you are causing it to create one big string out of each
> Iterable this way.
>
> On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar
e (heap size too small), or a bug that results in an application
> attempting to create a huge array, for example, when the number of elements
> in the array are computed using an algorithm that computes an incorrect
> size.”
>
>
>
>
> On 2 Nov, 2014, at 12:25 pm, Bharath
Resurfacing the thread. Oom shouldn't be the norm for a common groupby /
sort use case in a framework that is leading in sorting bench marks? Or is
there something fundamentally wrong in the usage?
On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar" wrote:
> Hi,
>
> I'm t
Minor clarification: I'm running spark 1.1.0 on JDK 1.8, Linux 64 bit.
On Sun, Nov 2, 2014 at 1:06 AM, Bharath Ravi Kumar
wrote:
> Hi,
>
> I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD
> of count ~ 100 million. The data size is 20GB and groupBy
Hi,
I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of
count ~ 100 million. The data size is 20GB and groupBy results in an RDD of
1061 keys with values being Iterable>. The job runs on 3 hosts in a standalone setup with each host's
executor having 100G RAM and 24 cores de
Update: as expected, switching to kryo merely delays the inevitable. Does
anyone have experience controlling memory consumption while processing
(e.g. writing out) imbalanced partitions?
On 09-Aug-2014 10:41 am, "Bharath Ravi Kumar" wrote:
> Our prototype application reads a 20GB
Our prototype application reads a 20GB dataset from HDFS (nearly 180
partitions), groups it by key, sorts by rank and write out to HDFS in that
order. The job runs against two nodes (16G, 24 cores per node available to
the job). I noticed that the execution plan results in two sortByKey
stages, fol
finitely not done on the driver. It works as you say. Look
> at the source code for RDD.takeOrdered, which is what top calls.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
>
> On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ra
I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster an
PM, Bharath Ravi Kumar
wrote:
> That's right, I'm looking to depend on spark in general and change only
> the hadoop client deps. The spark master and slaves use the
> spark-1.0.1-bin-hadoop1 binaries from the downloads page. The relevant
> snippet from the app
ps to clarify what you are depending on? Building
> custom Spark and depending on it is a different thing from depending
> on plain Spark and changing its deps. I think you want the latter.
>
> On Fri, Jul 25, 2014 at 5:46 PM, Bharath Ravi Kumar
> wrote:
> > Thanks for responding
linked to your build in your app?
>
> On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar
> wrote:
> > Any suggestions to work around this issue ? The pre built spark binaries
> > don't appear to work against cdh as documented, unless there's a build
> >
Any suggestions to work around this issue ? The pre built spark binaries
don't appear to work against cdh as documented, unless there's a build
issue, which seems unlikely.
On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar" wrote:
>
> I'm encountering a hadoop client p
I'm encountering a hadoop client protocol mismatch trying to read from HDFS
(cdh3u5) using the pre-build spark from the downloads page (linked under
"For Hadoop 1 (HDP1, CDH3)"). I've also followed the instructions at
http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
(i.e.
727 SUCCESS PROCESS_LOCAL slave2 2014/07/02
16:01:28 33 s 99 ms
Any pointers / diagnosis please?
On Thu, Jun 19, 2014 at 10:03 AM, Bharath Ravi Kumar
wrote:
> Thanks. I'll await the fix to re-run my test.
>
>
> On Thu, Jun 19, 2014 at 8:28 AM,
On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar
> wrote:
> > Couple more points:
> > 1)The inexplicable stalling of execution with large feature sets appears
> > similar to that reported with the news-20 dataset:
> >
> http://mail-archives.apache.org/mod_mbox/spark-user/2
a JavaPairRDD, Tuple2> is
unrelated to mllib.
Thanks,
Bharath
On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
wrote:
> Hi Xiangrui ,
>
> I'm using 1.0.0.
>
> Thanks,
> Bharath
> On 18-Jun-2014 1:43 am, "Xiangrui Meng" wrote:
>
>> Hi Bhar
Hi Xiangrui ,
I'm using 1.0.0.
Thanks,
Bharath
On 18-Jun-2014 1:43 am, "Xiangrui Meng" wrote:
> Hi Bharath,
>
> Thanks for posting the details! Which Spark version are you using?
>
> Best,
> Xiangrui
>
> On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kum
Hi,
(Apologies for the long mail, but it's necessary to provide sufficient
details considering the number of issues faced.)
I'm running into issues testing LogisticRegressionWithSGD a two node
cluster (each node with 24 cores and 16G available to slaves out of 24G on
the system). Here's a descrip
(Trying to bubble up the issue again...)
Any insights (based on the enclosed logs) into why standalone client
invocation might fail while issuing jobs through the spark console
succeeded?
Thanks,
Bharath
On Thu, May 15, 2014 at 5:08 PM, Bharath Ravi Kumar wrote:
> Hi,
>
> I'
Hi,
I'm running the spark server with a single worker on a laptop using the
docker images. The spark shell examples run fine with this setup. However,
a standalone java client that tries to run wordcount on a local files (1 MB
in size), the execution fails with the following error on the stdout of
43 matches
Mail list logo