We are running Apache Spark 1.5.0 (latest code from 1.5 branch) We are running 2-3 LogisticRegression models in parallel (we'd love to run 10-20 actually), they are not really big at all, maybe 1-2 million rows in each model.
Cluster itself, and all executors look good. Enough free memory and no exceptions or errors. However I see very strange behavior inside Spark driver. Allocated heap constantly growing. It grows up to 30 gigs in 1.5 hours and then everything becomes super sloooooow. We don't do any collect, and I really don't understand who is consuming all this memory. Looks like it's something inside LogisticRegression itself, however I only see treeAggregate which should not require so much memory to run. Any ideas? Plus I don't see any GC pause, looks like memory is still used by someone inside driver. [image: Inline image 2] [image: Inline image 1]