Re: Logistic Regression MLLib Slow

DB Tsai Wed, 04 Jun 2014 23:34:28 -0700

Hi Krishna,

Also, the default optimizer with SGD converges really slow. If you are
willing to write scala code, there is a full working example for
training Logistic Regression with L-BFGS (a quasi-Newton method) in
scala. It converges a way faster than SGD.


See
http://spark.apache.org/docs/latest/mllib-optimization.html
for detail.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng <men...@gmail.com> wrote:
> Hi Krishna,
>
> Specifying executor memory in local mode has no effect, because all of
> the threads run inside the same JVM. You can either try
> --driver-memory 60g or start a standalone server.
>
> Best,
> Xiangrui
>
> On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <men...@gmail.com> wrote:
>> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
>> take that long, even on a single executor. Besides what Matei
>> suggested, could you also verify the executor memory in
>> http://localhost:4040 in the Executors tab. It is very likely the
>> executors do not have enough memory. In that case, caching may be
>> slower than reading directly from disk. -Xiangrui
>>
>> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> 
>> wrote:
>>> Ah, is the file gzipped by any chance? We can’t decompress gzipped files in
>>> parallel so they get processed by a single task.
>>>
>>> It may also be worth looking at the application UI (http://localhost:4040)
>>> to see 1) whether all the data fits in memory in the Storage tab (maybe it
>>> somehow becomes larger, though it seems unlikely that it would exceed 20 GB)
>>> and 2) how many parallel tasks run in each iteration.
>>>
>>> Matei
>>>
>>> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com> wrote:
>>>
>>> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I am
>>> running to only 10 iterations.
>>>
>>> The MLLib version of logistic regression doesn't seem to use all the cores
>>> on my machine.
>>>
>>> Regards,
>>> Krishna
>>>
>>>
>>>
>>> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com>
>>> wrote:
>>>>
>>>> Are you using the logistic_regression.py in examples/src/main/python or
>>>> examples/src/main/python/mllib? The first one is an example of writing
>>>> logistic regression by hand and won’t be as efficient as the MLlib one. I
>>>> suggest trying the MLlib one.
>>>>
>>>> You may also want to check how many iterations it runs — by default I
>>>> think it runs 100, which may be more than you need.
>>>>
>>>> Matei
>>>>
>>>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote:
>>>>
>>>> > Hi All.,
>>>> >
>>>> > I am new to Spark and I am trying to run LogisticRegression (with SGD)
>>>> > using MLLib on a beefy single machine with about 128GB RAM. The dataset 
>>>> > has
>>>> > about 80M rows with only 4 features so it barely occupies 2Gb on disk.
>>>> >
>>>> > I am running the code using all 8 cores with 20G memory using
>>>> > spark-submit --executor-memory 20G --master local[8]
>>>> > logistic_regression.py
>>>> >
>>>> > It seems to take about 3.5 hours without caching and over 5 hours with
>>>> > caching.
>>>> >
>>>> > What is the recommended use for Spark on a beefy single machine?
>>>> >
>>>> > Any suggestions will help!
>>>> >
>>>> > Regards,
>>>> > Krishna
>>>> >
>>>> >
>>>> > Code sample:
>>>> >
>>>> > ---------------------------------------------------------------------------------------------------------------------
>>>> > # Dataset
>>>> > d = sys.argv[1]
>>>> > data = sc.textFile(d)
>>>> >
>>>> > # Load and parse the data
>>>> > #
>>>> > ----------------------------------------------------------------------------------------------------------
>>>> > def parsePoint(line):
>>>> >     values = [float(x) for x in line.split(',')]
>>>> >     return LabeledPoint(values[0], values[1:])
>>>> > _parsedData = data.map(parsePoint)
>>>> > parsedData = _parsedData.cache()
>>>> > results = {}
>>>> >
>>>> > # Spark
>>>> > #
>>>> > ----------------------------------------------------------------------------------------------------------
>>>> > start_time = time.time()
>>>> > # Build the gl_model
>>>> > niters = 10
>>>> > spark_model = LogisticRegressionWithSGD.train(parsedData,
>>>> > iterations=niters)
>>>> >
>>>> > # Evaluate the gl_model on training data
>>>> > labelsAndPreds = parsedData.map(lambda p: (p.label,
>>>> > spark_model.predict(p.features)))
>>>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
>>>> > float(parsedData.count())
>>>> >
>>>>
>>>
>>>

Re: Logistic Regression MLLib Slow

Reply via email to