Hello

I have ~5 million text documents, each around 10-15KB in size, and split
into ~15 columns. I intend to do machine learning, and thus I need to
extract all of the data at the same time, and potentially update everything
on every run.

So far I've just used json serializing, or simply cached the RDD to dick.
However, I feel like there must be a better way.

I have tried HBase, but I had a hard time setting it up and getting it to
work properly. It also felt like a lot of work for my simple requirements. I
want something /simple/.

Any suggestions?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-storage-solution-for-my-setup-5M-items-10KB-pr-tp26150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to