Re: Unclear problem with using S3 as a storage data source

Dave Novelli Wed, 28 Mar 2018 15:47:55 -0700

I don't *think* I need more spark nodes - I'm just using the one for
training on an r4.large instance I spin up and down as needed.


I was hoping to avoid adding any additional computational load to my
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am
looking for a way to *not* install HDFS on there as well. S3 seemed like it
would be a super convenient way to pass the model files back and forth, but
it sounds like it wasn't implemented as a data source for the model
repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda*
read Scala haha, maybe this would be a fun learning project. Do you think
it would be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | [email protected]

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <[email protected]> wrote:

> So you need to have more Spark nodes and this is the problem?
>
> If so setup HBase on pseudo-clustered HDFS so you have a master node
> address even though all storage is on one machine. Then you use that
> version of HDFS to tell Spark where to look for the model. It give the
> model a URI.
>
> I have never used the raw S3 support, HDFS can also be backed by S3 but
> you use HDFS APIs, it is an HDFS config setting to use S3.
>
> It is a rather unfortunate side effect of PIO but there are 2 ways to
> solve this with no extra servers.
>
> Maybe someone else knows how to use S3 natively for the model stub?
>
>
> From: Dave Novelli <[email protected]>
> <[email protected]>
> Date: March 28, 2018 at 12:13:12 PM
> To: Pat Ferrel <[email protected]> <[email protected]>
> Cc: [email protected] <[email protected]>
> <[email protected]>
> Subject:  Re: Unclear problem with using S3 as a storage data source
>
> Well, it looks like the local file system isn't an option in a
> multi-server configuration without manually setting up a process to
> transfer those stub model files.
>
> I trained models on one heavy-weight temporary instance, and then when I
> went to deploy from the prediction server instance it failed due to missing
> files. I copied the .pio_store/models directory from the training server
> over to the prediction server and then was able to deploy.
>
> So, in a dual-instance configuration what's the best way to store the
> files? I'm using pseudo-distributed HBase with standard file system storage
> instead of HDFS (my current aim is keeping down cost and complexity for a
> pilot project).
>
> Is S3 back on the table as on option?
>
> On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
> [email protected]> wrote:
>
>> Ahhh ok, thanks Pat!
>>
>>
>> Dave Novelli
>> Founder/Principal Consultant, Ultraviolet Analytics
>> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
>> [email protected]
>>
>> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <[email protected]>
>> wrote:
>>
>>> There is no need to have Universal Recommender models put in S3, they
>>> are not used and only exist (in stub form) because PIO requires them. The
>>> actual model lives in Elasticsearch and uses special features of ES to
>>> perform the last phase of the algorithm and so cannot be replaced.
>>>
>>> The stub PIO models have no data and will be tiny. putting them in HDFS
>>> or the local file system is recommended.
>>>
>>>
>>> From: Dave Novelli <[email protected]>
>>> <[email protected]>
>>> Reply: [email protected] <[email protected]>
>>> <[email protected]>
>>> Date: March 22, 2018 at 6:17:32 PM
>>> To: [email protected] <[email protected]>
>>> <[email protected]>
>>> Subject:  Unclear problem with using S3 as a storage data source
>>>
>>> Hi all,
>>>
>>> I'm using the Universal Recommender template and I'm trying to switch
>>> storage data sources from local file to S3 for the model repository. I've
>>> read the page at https://predictionio.apache
>>> .org/system/anotherdatastore/ to try to understand the configuration
>>> requirements, but when I run pio train it's indicating an error and nothing
>>> shows up in the s3 bucket:
>>>
>>> [ERROR] [S3Models] Failed to insert a model to
>>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>>
>>> I created a new bucket named "pio-model" and granted full public
>>> permissions.
>>>
>>> Seemingly relevant settings from pio-env.sh:
>>>
>>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>>> ...
>>>
>>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>>
>>> # I've tried with and without this
>>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>>
>>> # I've tried with and without this
>>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>>
>>>
>>> Any suggestions where I can start troubleshooting my configuration?
>>>
>>> Thanks,
>>> Dave
>>>
>>>
>>
>
>
> --
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> [email protected]
>
>

Re: Unclear problem with using S3 as a storage data source

Reply via email to