Re: Unclear problem with using S3 as a storage data source

Pat Ferrel Wed, 28 Mar 2018 16:41:19 -0700

Sorry then I don’t understand what part has no access to the file system on the 
single machine?


Also a t2 is not going to work with PIO. Spark 2 along requires something like 
2g for a do-nothing empty executor and driver, so a real app will require 16g 
or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get 
you to over 8g, then add data. Spark keeps all data needed at a given phase of 
the calculation in memory across the cluster, that’s where it gets it’s speed. 
Welcome to big-data :-)


From: Dave Novelli <[email protected]>
Reply: [email protected] <[email protected]>
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel <[email protected]>
Cc: [email protected] <[email protected]>
Subject:  Re: Unclear problem with using S3 as a storage data source  

I don't *think* I need more spark nodes - I'm just using the one for training 
on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my 
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking 
for a way to *not* install HDFS on there as well. S3 seemed like it would be a 
super convenient way to pass the model files back and forth, but it sounds like 
it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read 
Scala haha, maybe this would be a fun learning project. Do you think it would 
be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | [email protected]

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <[email protected]> wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address 
even though all storage is on one machine. Then you use that version of HDFS to 
tell Spark where to look for the model. It give the model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you use 
HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve 
this with no extra servers. 

Maybe someone else knows how to use S3 natively for the model stub?
 

From: Dave Novelli <[email protected]>
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel <[email protected]>
Cc: [email protected] <[email protected]>
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server 
configuration without manually setting up a process to transfer those stub 
model files.

I trained models on one heavy-weight temporary instance, and then when I went 
to deploy from the prediction server instance it failed due to missing files. I 
copied the .pio_store/models directory from the training server over to the 
prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the files? 
I'm using pseudo-distributed HBase with standard file system storage instead of 
HDFS (my current aim is keeping down cost and complexity for a pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <[email protected]> 
wrote:
Ahhh ok, thanks Pat!


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | [email protected]

On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <[email protected]> wrote:
There is no need to have Universal Recommender models put in S3, they are not 
used and only exist (in stub form) because PIO requires them. The actual model 
lives in Elasticsearch and uses special features of ES to perform the last 
phase of the algorithm and so cannot be replaced.

The stub PIO models have no data and will be tiny. putting them in HDFS or the 
local file system is recommended.


From: Dave Novelli <[email protected]>
Reply: [email protected] <[email protected]>
Date: March 22, 2018 at 6:17:32 PM
To: [email protected] <[email protected]>
Subject:  Unclear problem with using S3 as a storage data source

Hi all,

I'm using the Universal Recommender template and I'm trying to switch storage 
data sources from local file to S3 for the model repository. I've read the page 
at https://predictionio.apache.org/system/anotherdatastore/ to try to 
understand the configuration requirements, but when I run pio train it's 
indicating an error and nothing shows up in the s3 bucket: 

[ERROR] [S3Models] Failed to insert a model to 
s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d

I created a new bucket named "pio-model" and granted full public permissions.

Seemingly relevant settings from pio-env.sh:

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
...

PIO_STORAGE_SOURCES_S3_TYPE=s3
PIO_STORAGE_SOURCES_S3_REGION=us-west-2
PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com

# I've tried with and without this
#PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model


Any suggestions where I can start troubleshooting my configuration?

Thanks,
Dave




--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | [email protected]

Re: Unclear problem with using S3 as a storage data source

Reply via email to