Hi Krishna, Also, the default optimizer with SGD converges really slow. If you are willing to write scala code, there is a full working example for training Logistic Regression with L-BFGS (a quasi-Newton method) in scala. It converges a way faster than SGD.
See http://spark.apache.org/docs/latest/mllib-optimization.html for detail. Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng <men...@gmail.com> wrote: > Hi Krishna, > > Specifying executor memory in local mode has no effect, because all of > the threads run inside the same JVM. You can either try > --driver-memory 60g or start a standalone server. > > Best, > Xiangrui > > On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <men...@gmail.com> wrote: >> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't >> take that long, even on a single executor. Besides what Matei >> suggested, could you also verify the executor memory in >> http://localhost:4040 in the Executors tab. It is very likely the >> executors do not have enough memory. In that case, caching may be >> slower than reading directly from disk. -Xiangrui >> >> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >>> Ah, is the file gzipped by any chance? We can’t decompress gzipped files in >>> parallel so they get processed by a single task. >>> >>> It may also be worth looking at the application UI (http://localhost:4040) >>> to see 1) whether all the data fits in memory in the Storage tab (maybe it >>> somehow becomes larger, though it seems unlikely that it would exceed 20 GB) >>> and 2) how many parallel tasks run in each iteration. >>> >>> Matei >>> >>> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com> wrote: >>> >>> I am using the MLLib one (LogisticRegressionWithSGD) with PySpark. I am >>> running to only 10 iterations. >>> >>> The MLLib version of logistic regression doesn't seem to use all the cores >>> on my machine. >>> >>> Regards, >>> Krishna >>> >>> >>> >>> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com> >>> wrote: >>>> >>>> Are you using the logistic_regression.py in examples/src/main/python or >>>> examples/src/main/python/mllib? The first one is an example of writing >>>> logistic regression by hand and won’t be as efficient as the MLlib one. I >>>> suggest trying the MLlib one. >>>> >>>> You may also want to check how many iterations it runs — by default I >>>> think it runs 100, which may be more than you need. >>>> >>>> Matei >>>> >>>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote: >>>> >>>> > Hi All., >>>> > >>>> > I am new to Spark and I am trying to run LogisticRegression (with SGD) >>>> > using MLLib on a beefy single machine with about 128GB RAM. The dataset >>>> > has >>>> > about 80M rows with only 4 features so it barely occupies 2Gb on disk. >>>> > >>>> > I am running the code using all 8 cores with 20G memory using >>>> > spark-submit --executor-memory 20G --master local[8] >>>> > logistic_regression.py >>>> > >>>> > It seems to take about 3.5 hours without caching and over 5 hours with >>>> > caching. >>>> > >>>> > What is the recommended use for Spark on a beefy single machine? >>>> > >>>> > Any suggestions will help! >>>> > >>>> > Regards, >>>> > Krishna >>>> > >>>> > >>>> > Code sample: >>>> > >>>> > --------------------------------------------------------------------------------------------------------------------- >>>> > # Dataset >>>> > d = sys.argv[1] >>>> > data = sc.textFile(d) >>>> > >>>> > # Load and parse the data >>>> > # >>>> > ---------------------------------------------------------------------------------------------------------- >>>> > def parsePoint(line): >>>> > values = [float(x) for x in line.split(',')] >>>> > return LabeledPoint(values[0], values[1:]) >>>> > _parsedData = data.map(parsePoint) >>>> > parsedData = _parsedData.cache() >>>> > results = {} >>>> > >>>> > # Spark >>>> > # >>>> > ---------------------------------------------------------------------------------------------------------- >>>> > start_time = time.time() >>>> > # Build the gl_model >>>> > niters = 10 >>>> > spark_model = LogisticRegressionWithSGD.train(parsedData, >>>> > iterations=niters) >>>> > >>>> > # Evaluate the gl_model on training data >>>> > labelsAndPreds = parsedData.map(lambda p: (p.label, >>>> > spark_model.predict(p.features))) >>>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / >>>> > float(parsedData.count()) >>>> > >>>> >>> >>>