I don't *think* I need more spark nodes - I'm just using the one for training on an r4.large instance I spin up and down as needed.
I was hoping to avoid adding any additional computational load to my Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking for a way to *not* install HDFS on there as well. S3 seemed like it would be a super convenient way to pass the model files back and forth, but it sounds like it wasn't implemented as a data source for the model repository for UR. Perhaps that's something I could implement and contribute? I can *kinda* read Scala haha, maybe this would be a fun learning project. Do you think it would be fairly straightforward? Dave Novelli Founder/Principal Consultant, Ultraviolet Analytics www.ultravioletanalytics.com | 919.210.0948 | [email protected] On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <[email protected]> wrote: > So you need to have more Spark nodes and this is the problem? > > If so setup HBase on pseudo-clustered HDFS so you have a master node > address even though all storage is on one machine. Then you use that > version of HDFS to tell Spark where to look for the model. It give the > model a URI. > > I have never used the raw S3 support, HDFS can also be backed by S3 but > you use HDFS APIs, it is an HDFS config setting to use S3. > > It is a rather unfortunate side effect of PIO but there are 2 ways to > solve this with no extra servers. > > Maybe someone else knows how to use S3 natively for the model stub? > > > From: Dave Novelli <[email protected]> > <[email protected]> > Date: March 28, 2018 at 12:13:12 PM > To: Pat Ferrel <[email protected]> <[email protected]> > Cc: [email protected] <[email protected]> > <[email protected]> > Subject: Re: Unclear problem with using S3 as a storage data source > > Well, it looks like the local file system isn't an option in a > multi-server configuration without manually setting up a process to > transfer those stub model files. > > I trained models on one heavy-weight temporary instance, and then when I > went to deploy from the prediction server instance it failed due to missing > files. I copied the .pio_store/models directory from the training server > over to the prediction server and then was able to deploy. > > So, in a dual-instance configuration what's the best way to store the > files? I'm using pseudo-distributed HBase with standard file system storage > instead of HDFS (my current aim is keeping down cost and complexity for a > pilot project). > > Is S3 back on the table as on option? > > On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli < > [email protected]> wrote: > >> Ahhh ok, thanks Pat! >> >> >> Dave Novelli >> Founder/Principal Consultant, Ultraviolet Analytics >> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> | >> [email protected] >> >> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <[email protected]> >> wrote: >> >>> There is no need to have Universal Recommender models put in S3, they >>> are not used and only exist (in stub form) because PIO requires them. The >>> actual model lives in Elasticsearch and uses special features of ES to >>> perform the last phase of the algorithm and so cannot be replaced. >>> >>> The stub PIO models have no data and will be tiny. putting them in HDFS >>> or the local file system is recommended. >>> >>> >>> From: Dave Novelli <[email protected]> >>> <[email protected]> >>> Reply: [email protected] <[email protected]> >>> <[email protected]> >>> Date: March 22, 2018 at 6:17:32 PM >>> To: [email protected] <[email protected]> >>> <[email protected]> >>> Subject: Unclear problem with using S3 as a storage data source >>> >>> Hi all, >>> >>> I'm using the Universal Recommender template and I'm trying to switch >>> storage data sources from local file to S3 for the model repository. I've >>> read the page at https://predictionio.apache >>> .org/system/anotherdatastore/ to try to understand the configuration >>> requirements, but when I run pio train it's indicating an error and nothing >>> shows up in the s3 bucket: >>> >>> [ERROR] [S3Models] Failed to insert a model to >>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d >>> >>> I created a new bucket named "pio-model" and granted full public >>> permissions. >>> >>> Seemingly relevant settings from pio-env.sh: >>> >>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model >>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3 >>> ... >>> >>> PIO_STORAGE_SOURCES_S3_TYPE=s3 >>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2 >>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model >>> >>> # I've tried with and without this >>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com >>> >>> # I've tried with and without this >>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model >>> >>> >>> Any suggestions where I can start troubleshooting my configuration? >>> >>> Thanks, >>> Dave >>> >>> >> > > > -- > Dave Novelli > Founder/Principal Consultant, Ultraviolet Analytics > www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> | > [email protected] > >
