Hallo,
I used Mahout for Text Classification and now I'm trying with Spark.
I had the same Problem training Bayes with (only) 569 Documents.
I solved doing htf = HashingTF(5000) instead of htf = HashingTF() [default
Features Space 2^20). I don't know if it can be considered a longterm
Solution (w
For the vectorizer, what's the output feature dimension and are you
creating sparse vectors or dense vectors? The model on the driver
consists of numClasses * numFeatures doubles. However, the driver
needs more memory in order to receive the task result (of the same
size) from executors. So you nee
Hi,
I was able to get the training running in local mode with default settings,
there was a problem with document labels which were quite large(not 20 as
suggested earlier).
I am currently training 175000 documents on a single node with 2GB of
executor memory and 5GB of driver memory successfull
Xiangrui, Thanks for replying.
I am using the subset of newsgroup20 data. I will send you the vectorized
data for analysis shortly.
I have tried running in local mode as well but I get the same OOM exception.
I started with 4GB of data but then moved to smaller set to verify that
everything was
You dataset is small. NaiveBayes should work under the default
settings, even in local mode. Could you try local mode first without
changing any Spark settings? Since your dataset is small, could you
save the vectorized data (RDD[LabeledPoint]) and send me a sample? I
want to take a look at the fea
I get the following stacktrace if it is of any help.
14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7:
List()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7
(MapPartitionsRDD[24] at combineByK
Xiangrui,
Yes, the total number of terms is 43839. I have also tried running it using
different values of parallelism ranging from 1/core to 10/core. I also used
multiple configurations like setting spark.storage.memoryFaction and
spark.shuffle.memoryFraction to default values. The point to note
Does feature size 43839 equal to the number of terms? Check the output
dimension of your feature vectorizer and reduce number of partitions
to match the number of physical cores. I saw you set
spark.storage.memoryFaction to 0.0. Maybe it is better to keep the
default. Also please confirm the driver