Re: Spark fpg large basket

Sean Barzilay Wed, 11 Mar 2015 06:29:10 -0700

My min support is low and after filling out all my space I am applying a
filter on the results to only get item seta that interest me


On Wed, 11 Mar 2015 1:58 pm Sean Owen <so...@cloudera.com> wrote:

> Have you looked at how big your output is? for example, if your min
> support is very low, you will output a massive volume of frequent item
> sets. If that's the case, then it may be expected that it's taking
> ages to write terabytes of data.
>
> On Wed, Mar 11, 2015 at 8:34 AM, Sean Barzilay <sesnbarzi...@gmail.com>
> wrote:
> > The program spends its time when I am writing the output to a text file
> and
> > I am using 70 partitions
> >
> >
> > On Wed, 11 Mar 2015 9:55 am Sean Owen <so...@cloudera.com> wrote:
> >>
> >> I don't think there is enough information here. Where is the program
> >> spending its time? where does it "stop"? how many partitions are
> >> there?
> >>
> >> On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das <ak...@sigmoidanalytics.com>
> >> wrote:
> >> > You need to set spark.cores.max to a number say 16, so that on all 4
> >> > machines the tasks will get distributed evenly, Another thing would be
> >> > to
> >> > set spark.default.parallelism if you haven't tried already.
> >> >
> >> > Thanks
> >> > Best Regards
> >> >
> >> > On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay <
> sesnbarzi...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I am running on a 4 workers cluster each having between 16 to 30
> cores
> >> >> and
> >> >> 50 GB of ram
> >> >>
> >> >>
> >> >> On Wed, 11 Mar 2015 8:55 am Akhil Das <ak...@sigmoidanalytics.com>
> >> >> wrote:
> >> >>>
> >> >>> Depending on your cluster setup (cores, memory), you need to specify
> >> >>> the
> >> >>> parallelism/repartition the data.
> >> >>>
> >> >>> Thanks
> >> >>> Best Regards
> >> >>>
> >> >>> On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay
> >> >>> <sesnbarzi...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hi I am currently using spark 1.3.0-snapshot to run the fpg
> algorithm
> >> >>>> from the mllib library. When I am trying to run the algorithm over
> a
> >> >>>> large
> >> >>>> basket(over 1000 items) the program seems to never finish. Did
> anyone
> >> >>>> find a
> >> >>>> workaround for this problem?
> >> >>>
> >> >>>
> >> >
>

Re: Spark fpg large basket

Reply via email to