from:"Pat Ferrel"

Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to the Attic

2020-08-31 Thread Pat Ferrel

To try to keep this on-subject I’ll say that I’ve been working on what I once 
saw as a next-gen PIO. It is ASL 2, and has 2 engines that ran in PIO — most 
notably the Universal Recommender. We offered to make the Harness project part 
of PIO a couple years back but didn’t get much interest. It is now in 
v0.6.0-SHAPSHOT. The key difference is that it is designed for the user, rather 
than the Data Scientist.

Check Harness out: https://github.com/actionml/harness Contributors are 
welcome. 

We owe everything to PIO where we proved it could be done.



From: Donald Szeto 
Reply: user@predictionio.apache.org 
Date: August 29, 2020 at 3:45:04 PM
To: d...@predictionio.apache.org 
Cc: user@predictionio.apache.org 
Subject:  Re: [DISCUSS] Dissolve Apache PredictionIO PMC and move project to 
the Attic  

It looks like there is no objection. I will start a vote shortly.

Regards,
Donald

On Mon, Aug 24, 2020 at 1:17 PM Donald Szeto  wrote:
Hi all,

The Apache PredictionIO project had an amazing ride back in its early years. 
Unfortunately, its momentum had declined, and its core technology had fallen 
behind. Although we have received some appeal from the community to help bring 
the project up to speed, the effort is not sufficient.

I think it is about time to archive the project. The proper way to do so is to 
follow the Apache Attic process documented at 
http://attic.apache.org/process.html. This discussion thread is the first step. 
If there is no objection, it will be followed by a voting thread.

Existing users: This move should not impact existing functionality, as the 
source code will still be available through the Apache Attic, in a read-only 
state.

Thank you for your continued support over the years. The project would not be 
possible without your help.

Regards,
Donald

Re: PredictionIO ASF Board Report for Mar 2020

2020-03-19 Thread Pat Ferrel

PredictionIO is scalable BY SCALING ITS SUB-SERVICES. Running on a single
machine sounds like no scaling has been executed or even planned.

How do you scale ANY system?
1) vertical scaling: make the instance larger with more cores, more disk,
and most importantly more memory. Increase whatever resource you need most
but all will be affected eventually.
2) move each service to its own instance. Move the DB, Spark, etc (depends
on what you are using) Then you can scale the sub-service (the ones PIO
uses) independently as needed.

Without a scaling plan you must trim your data to fit the system you have.
For instance save only a few months of data. Unfortunately PIO has no
automatic way to do this, like a TTL. We created a template that you can
run to trim your db by dropping old data. Unfortunately we have not kept up
with PIO versions since we have moved to another ML server that DOES have
TTLs.

If anyone wants to upgrade the template it was last used with PIO 0.12.x
and is here: https://github.com/actionml/db-cleaner

If you continually add data to a bucket it will eventually overflow, how
could it be any other way?



From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: March 19, 2020 at 7:43:08 AM
To: user@predictionio.apache.org 

Subject:  Re: PredictionIO ASF Board Report for Mar 2020

Hello!

My knowledge to predictionio is limited. I was able to set up a
predictionIO server and run on it two templates, the recommendation and
similar item template. The server is on production in my company and we
were having good results. Suddenly, as we feed data to the server, our
cloud machine memory got full and we can't have new data anymore nor we can
process this data. An error message on ubuntu state: "No space left on
device".

I am deploying this server on a single machine without any cluster or the
help of docker. Do you have any suggestion to solve this issue? Also, is
there a way to clean the machine from old data it has?

As a final note, my knowledge in the data engineer and machine learning
field is limited. I understand scala and can work with spark. However, I am
willing to dig deeper into predictionio. Do you think there is a way I can
contribute to the community in one way or another? Or you're just looking
for true experts in order to avoid moving the project to attic?

Regards
Sami Serbey
--
*From:* Donald Szeto 
*Sent:* Tuesday, March 10, 2020 8:26 PM
*To:* user@predictionio.apache.org ;
d...@predictionio.apache.org 
*Subject:* PredictionIO ASF Board Report for Mar 2020

Hi all,

Please take a look at the draft report below and make your comments or
edits as you see fit. The draft will be submitted on Mar 11, 2020.

Regards,
Donald

## Description:
The mission of Apache Predictionio is the creation and maintenance of
software
related to a machine learning server built on top of state-of-the-art open
source stack, that enables developers to manage and deploy production-ready
predictive services for various kinds of machine learning tasks

## Issues:
Update: A community member, who's a committer and PMC of another Apache
project, has expressed interest in helping. The member has been engaged and
we are waiting for actions from that member.

Last report: No PMC chair nominee was nominated a week after the PMC chair
expressed
intention to resign from the chair on the PMC mailing list.

## Membership Data:
Apache PredictionIO was founded 2017-10-17 (2 years ago)
There are currently 29 committers and 28 PMC members in this project.
The Committer-to-PMC ratio is roughly 8:7.

Community changes, past quarter:
- No new PMC members. Last addition was Andrew Kyle Purtell on 2017-10-17.
- No new committers were added.

## Project Activity:
Sparse activities only on mailing list.

Recent releases:

0.14.0 was released on 2019-03-11.
0.13.0 was released on 2018-09-20.
0.12.1 was released on 2018-03-11.

## Community Health:
Update: A community member, who's a committer and PMC of another Apache
project, has expressed interest in helping. The member has been engaged and
we are waiting for actions from that member to see if a nomination to PMC
and chair would be appropriate.

Last report: We are seeking new leadership for the project at the moment to
bring it out
of maintenance mode. Moving to the attic would be the last option.

Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel

Oops, should have said: "I may have missed something but I don’t recall PIO
being released by Apache as an ASF maintained container/image release
artifact."


From: Pat Ferrel  
Reply: user@predictionio.apache.org 

Date: July 3, 2019 at 11:16:43 AM
To: Wei Chen  ,
d...@predictionio.apache.org 
, user@predictionio.apache.org
 
Subject:  Re: JAVA_HOME is not set

BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.


From: Wei Chen  
Reply: d...@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: user@predictionio.apache.org 

Cc: d...@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>

Re: JAVA_HOME is not set

2019-07-03 Thread Pat Ferrel

BTW the container you use is supported by the container author, if at all.

I may have missed something but I don’t recall PIO being released by Apache
as an ASF maintained release artifact.

I wish ASF projects would publish Docker Images made for real system
integration, but IIRC PIO does not.

From: Wei Chen  
Reply: d...@predictionio.apache.org 

Date: July 2, 2019 at 5:14:38 PM
To: user@predictionio.apache.org 

Cc: d...@predictionio.apache.org 

Subject:  Re: JAVA_HOME is not set

Add these steps in your docker file.
https://vitux.com/how-to-setup-java_home-path-in-ubuntu/

Best Regards
Wei

On Wed, Jul 3, 2019 at 5:06 AM Alexey Grachev <
alexey.grac...@turbinekreuzberg.com> wrote:

> Hello,
>
>
> I installed newest pio 0.14 with docker. (ubuntu 18.04)
>
> After starting pio -> I get "JAVA_HOME is not set"
>
>
> Does anyone know where in docker config I have to setup the JAVA_HOME
> env variable?
>
>
> Thanks a lot!
>
>
> Alexey
>
>
>

Re: new install help

2019-04-15 Thread Pat Ferrel

Most people running on a Windows machine use a VM running Linux. You will
run into constant issues if you go down another road with something like
cygwin, so avoid the headache.

From: Steve Pruitt  
Reply: user@predictionio.apache.org 

Date: April 15, 2019 at 10:59:09 AM
To: user@predictionio.apache.org 

Subject:  new install help

I installed on a Windows 10 box.  A couple of questions and then a problem
I have.

I downloaded the binary distribution.

I already had Spark installed, so I changed pio-env.sh to point to my Spark.

I downloaded and installed Postgres.  I downloaded the jdbc driver and put
it in the PredictionIO-0.14.0\lib folder.

My questions are:

Reading the PIO install directions I cannot tell if ElasticSearch and HBase
are optional.  The pio-env.sh file has references to them commented out and
the PIO install page makes mention of skipping them if not using them.  So,
I didn’t install them.

When I tried executing PredictionIO-0.14.0\bin\pio eventserver & command
from the command line, I got this error

'PredictionIO-0.14.0\bin\pio' is not recognized as an internal or external
command, operable program or batch file.

Oops.  I think my assumption PIO runs on Windows is bad.  I want to confirm
it’s not something I overlooked.

-S

Re: universal recommender version

2018-11-27 Thread Pat Ferrel

There is a tag v0.7.3 and yes it is in master:

https://github.com/actionml/universal-recommender/tree/v0.7.3

From: Marco Goldin 
Reply: user@predictionio.apache.org 
Date: November 20, 2018 at 6:56:39 AM
To: user@predictionio.apache.org , 
gyar...@griddynamics.com 
Subject:  Re: universal recommender version  

Hi George, most recent current stable release is 0.7.3, which is simply in the 
branch master, that's why you don't see a 0.7.3 tag.
Git download the master and you'll be fine.
If you check the build.sbt in master you'll see specs as:

version := "0.7.3"
scalaVersion := "2.11.11"

that's the one you're looking for. 

Il giorno mar 20 nov 2018 alle ore 15:47 George Yarish 
 ha scritto:
Hi,

Can please some one advise what is the most recent current release version of 
universal recommender and where it is source code located?

According to GitHub project https://github.com/actionml/universal-recommender 
branches it is v0.8.0 (this branch looks bit outdated)
but according to documentation https://actionml.com/docs/ur_version_log
it is 0.7.3 which can't be found in GitHub repo. 

Thanks,
George

Re: PIO train issue

2018-08-29 Thread Pat Ferrel

Assuming your are using the UR…

I don’t know how many times this has been caused by a misspelling of
eventNames in engine.json but assume you have checked that.

The fail-safe way to check is to `pio export` your data and check it
against your engine.json.

BTW `pio status` does not even try to check all services. Run `pio app
list` to see if the right appnames (dataset names) are in the EventServer,
which will check hbase, hdfs, and elasticsearch. Then check to see you have
Spark. Elasticsearch and HDFS running—if you have set them to run in remote
standalone mode.


From: bala vivek  
Date: August 29, 2018 at 8:43:05 AM
To: actionml-user 
, user@predictionio.apache.org
 
Subject:  PIO train issue

Hi PIO users,

I'm using the PIO 0.10 version for a long time. I recently moved the
working setup of PIO to CentOS from Ubuntu and it seems to work fine when I
checked the PIO status, It shows all the services are up and working.
But while doing a PIO train I could see "Data set is empty" error, I have
cross checked and saw the hbase table manually by scanning the tables and
the records are present inside the event table. To cross verify I tried to
do a Curl with the help of access key for a particular app and the response
to it is "http200.ok"  so it's confirmed the app id or a particular app has
the data.
But if I run the command pio train manually it's not training and the
model. The engine file has no issues as the appname also given correctly.
It always shows "Data set is empty". This same setup is working fine with
Ubuntu 14 version. I havent made any config changes to make it run in
centos.

Let me know what will be the reason for this issue as the data is present
in Hbase but the PIO engine fails to detect it.

Thanks
Bala
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CABdDaRqqpGcPb%3DZD-ms6i5OzY8_JdLQ3YbbcapS_dS8TxkGidQ%40mail.gmail.com

.
For more options, visit https://groups.google.com/d/optout.

Re: Distinct recommendation from "random" backfill?

2018-08-28 Thread Pat Ferrel

The random ranking is assigned after every `pio train` so if you have not
trained in-between, they will be the same. Random is not really meant to do
what you are using it for, it is meant to surface items with no data—no
primary events. This will allow some to get real events and be recommended
for the events next time you train. It is meant to fill in when you ask for
20 recs but there are only 10 things to be recommended. Proper use of this
with frequent training will cause items with no data to be purchased and to
therefore get data. The reason rankings are assigned at train time is that
this is the only way to get all of the business rules applied to the query
as well as a random ranking. In other words the ranking must be built into
the model with `pio train`

If you want to recommend random items each time you query, create a list of
item ids from your catalog and return some random sample each query
yourself. This should be nearly trivial.


From: Brian Chiu  
Reply: user@predictionio.apache.org 

Date: August 28, 2018 at 1:51:24 AM
To: u...@predictionio.incubator.apache.org


Subject:  Distinct recommendation from "random" backfill?

Dear pio developers and users:

I have been using predictionIO and Universal Recommender for a while.
In universal recommender engiene.json, there is a configuration field
`rankings`, and one of the option is random. Initially I thought it
would give each item without any related event some random recommended
items, and each of the recommendation list is different. However, it
turns out all of the random recommended item list is the same. For
example, if both item "6825991" and item "682599" have no events
during training, the result will be

```
$ curl -H "Content-Type: application/json" -d '{ "item": "6825991" }'
http://localhost:8000/queries.json
{"itemScores":[{"item":"8083748","score":0.0},{"item":"7942100","score":0.0},{"item":"8016271","score":0.0},{"item":"7731061","score":0.0},{"item":"8002458","score":0.0},{"item":"7763317","score":0.0},{"item":"8141119","score":0.0},{"item":"8080694","score":0.0},{"item":"7994844","score":0.0},{"item":"7951667","score":0.0},{"item":"7948453","score":0.0},{"item":"8148479","score":0.0},{"item":"8113083","score":0.0},{"item":"8041124","score":0.0},{"item":"8004823","score":0.0},{"item":"8126058","score":0.0},{"item":"8093042","score":0.0},{"item":"8064036","score":0.0},{"item":"8022524","score":0.0},{"item":"7977131","score":0.0}]}

$ curl -H "Content-Type: application/json" -d '{ "item": "682599" }'
http://localhost:8000/queries.json
{"itemScores":[{"item":"8083748","score":0.0},{"item":"7942100","score":0.0},{"item":"8016271","score":0.0},{"item":"7731061","score":0.0},{"item":"8002458","score":0.0},{"item":"7763317","score":0.0},{"item":"8141119","score":0.0},{"item":"8080694","score":0.0},{"item":"7994844","score":0.0},{"item":"7951667","score":0.0},{"item":"7948453","score":0.0},{"item":"8148479","score":0.0},{"item":"8113083","score":0.0},{"item":"8041124","score":0.0},{"item":"8004823","score":0.0},{"item":"8126058","score":0.0},{"item":"8093042","score":0.0},{"item":"8064036","score":0.0},{"item":"8022524","score":0.0},{"item":"7977131","score":0.0}]}

```

But I my webpage, whenever user click on these products without
events, they will see exactly the same recommended items, making it
looks boring. Is there anyway to give each item distinct random list?
Even if it is generated dynamically is OK. If you have any other
alternative, please also tell me.

Thanks all developers!

Best Regards,
Brian

Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel

Oh and no it does not need a new context for every query, only for the
deploy.

From: Pat Ferrel  
Date: August 7, 2018 at 10:00:49 AM
To: Ulavapalle Meghamala 

Cc: user@predictionio.apache.org 
, actionml-user

Subject:  Re: PredictionIO spark deployment in Production

The answers to your question illustrate why IMHO it is bad to have Spark
required for predictions.

Any of the MLlib ALS recommenders use Spark to predict and so run Spark
during the time they are deployed.. They can use one machine or use the
entire cluster. This is one case where using the cluster slows down
predictions since part of the model may be spread across nodes. Spark is
not designed to scale in this manner for real-time queries but I believe
those are your options out of the box for the ALS recommenders.

To be both fast and scalable you would load the model entirely into memory
on one machine for fast queries then spread queries across many identical
machines to scale load. I don’t think any templates do this—it requires a
load balancer at very least, not to mention custom deployment code that
interferes with using the same machines for training.

The UR loads the model into Elasticsearch for serving independently
scalable queries.

I always advise you keep Spark out of serving for the reasons mentioned
above.

From: Ulavapalle Meghamala 

Date: August 7, 2018 at 9:27:46 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 
, actionml-user

Subject:  Re: PredictionIO spark deployment in Production

Thanks Pat for getting back.

Are there any PredictionIO models/templates which really use Spark in "pio
deploy" ? (not just loading the Spark Context for loading the 'pio deploy'
driver and then dropping the Spark Context), but a running Spark Context
through out the Prediction Server life cycle ? Or How does Prediction IO
handle this case ? Does it create a new Spark Context every time a
prediction has to be done ?

Also, in the production deployments(where Spark is not really used), how do
you scale Prediction Server ? Do you just deploy same model on multiple
machines and have a LB/HA Proxy to handle requests?

Thanks,
Megha

On Tue, Aug 7, 2018 at 9:35 PM, Pat Ferrel  wrote:

> PIO is designed to use Spark in train and deploy. But the Universal
> Recommender removes the need for Spark to make predictions. This IMO is a
> key to use Spark well—remove it from serving results. PIO creates a Spark
> context to launch the `pio deploy' driver but Spark is never used and the
> context is dropped.
>
> The UR also does not need to be re-deployed after each train. It hot swaps
> the new model into use outside of Spark and so if you never shut down the
>  PredictionServer you never need to re-deploy.
>
> The confusion comes from reading Apache PIO docs which may not do things
> this way—don’t read them. Each template defines it’s own requirements. To
> use the UR stick with it’s documentation.
>
> That means Spark is used to “train” only and you never re-deploy. Deploy
> once—train periodically.
>
>
> From: Ulavapalle Meghamala 
> 
> Reply: user@predictionio.apache.org 
> 
> Date: August 7, 2018 at 4:13:39 AM
> To: user@predictionio.apache.org 
> 
> Subject:  PredictionIO spark deployment in Production
>
> Hi,
>
> Are there any templates in PredictionIO where "spark" is used even in "pio
> deploy" ? How are you handling such cases ? Will you create a spark context
> every time you run a prediction ?
>
> I have gone through then documentation here: http://actionml.com/docs/
> single_driver_machine. But, it only talks about "pio train". Please guide
> me to any documentation that is available on the "pio deploy" with spark ?
>
> Thanks,
> Megha
>
>
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAOtZQD-KRpqz-Po6%3D%2BL2WhUh7kKa64yGihP44iSNdqb9nFE0Dg%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CAOtZQD-KRpqz-Po6%3D%2BL2WhUh7kKa64yGihP44iSNdqb9nFE0Dg%40mail.gmail.com?utm_medium=email_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel

The answers to your question illustrate why IMHO it is bad to have Spark
required for predictions.

Any of the MLlib ALS recommenders use Spark to predict and so run Spark
during the time they are deployed.. They can use one machine or use the
entire cluster. This is one case where using the cluster slows down
predictions since part of the model may be spread across nodes. Spark is
not designed to scale in this manner for real-time queries but I believe
those are your options out of the box for the ALS recommenders.

To be both fast and scalable you would load the model entirely into memory
on one machine for fast queries then spread queries across many identical
machines to scale load. I don’t think any templates do this—it requires a
load balancer at very least, not to mention custom deployment code that
interferes with using the same machines for training.

The UR loads the model into Elasticsearch for serving independently
scalable queries.

I always advise you keep Spark out of serving for the reasons mentioned
above.

From: Ulavapalle Meghamala 

Date: August 7, 2018 at 9:27:46 AM
To: Pat Ferrel  
Cc: user@predictionio.apache.org 
, actionml-user

Subject:  Re: PredictionIO spark deployment in Production

Thanks Pat for getting back.

Are there any PredictionIO models/templates which really use Spark in "pio
deploy" ? (not just loading the Spark Context for loading the 'pio deploy'
driver and then dropping the Spark Context), but a running Spark Context
through out the Prediction Server life cycle ? Or How does Prediction IO
handle this case ? Does it create a new Spark Context every time a
prediction has to be done ?

Also, in the production deployments(where Spark is not really used), how do
you scale Prediction Server ? Do you just deploy same model on multiple
machines and have a LB/HA Proxy to handle requests?

Thanks,
Megha

On Tue, Aug 7, 2018 at 9:35 PM, Pat Ferrel  wrote:

> PIO is designed to use Spark in train and deploy. But the Universal
> Recommender removes the need for Spark to make predictions. This IMO is a
> key to use Spark well—remove it from serving results. PIO creates a Spark
> context to launch the `pio deploy' driver but Spark is never used and the
> context is dropped.
>
> The UR also does not need to be re-deployed after each train. It hot swaps
> the new model into use outside of Spark and so if you never shut down the
>  PredictionServer you never need to re-deploy.
>
> The confusion comes from reading Apache PIO docs which may not do things
> this way—don’t read them. Each template defines it’s own requirements. To
> use the UR stick with it’s documentation.
>
> That means Spark is used to “train” only and you never re-deploy. Deploy
> once—train periodically.
>
>
> From: Ulavapalle Meghamala 
> 
> Reply: user@predictionio.apache.org 
> 
> Date: August 7, 2018 at 4:13:39 AM
> To: user@predictionio.apache.org 
> 
> Subject:  PredictionIO spark deployment in Production
>
> Hi,
>
> Are there any templates in PredictionIO where "spark" is used even in "pio
> deploy" ? How are you handling such cases ? Will you create a spark context
> every time you run a prediction ?
>
> I have gone through then documentation here: http://actionml.com/docs/
> single_driver_machine. But, it only talks about "pio train". Please guide
> me to any documentation that is available on the "pio deploy" with spark ?
>
> Thanks,
> Megha
>
>

Re: PredictionIO spark deployment in Production

2018-08-07 Thread Pat Ferrel

PIO is designed to use Spark in train and deploy. But the Universal
Recommender removes the need for Spark to make predictions. This IMO is a
key to use Spark well—remove it from serving results. PIO creates a Spark
context to launch the `pio deploy' driver but Spark is never used and the
context is dropped.

The UR also does not need to be re-deployed after each train. It hot swaps
the new model into use outside of Spark and so if you never shut down the
 PredictionServer you never need to re-deploy.

The confusion comes from reading Apache PIO docs which may not do things
this way—don’t read them. Each template defines it’s own requirements. To
use the UR stick with it’s documentation.

That means Spark is used to “train” only and you never re-deploy. Deploy
once—train periodically.


From: Ulavapalle Meghamala 

Reply: user@predictionio.apache.org 

Date: August 7, 2018 at 4:13:39 AM
To: user@predictionio.apache.org 

Subject:  PredictionIO spark deployment in Production

Hi,

Are there any templates in PredictionIO where "spark" is used even in "pio
deploy" ? How are you handling such cases ? Will you create a spark context
every time you run a prediction ?

I have gone through then documentation here:
http://actionml.com/docs/single_driver_machine. But, it only talks about
"pio train". Please guide me to any documentation that is available on the
"pio deploy" with spark ?

Thanks,
Megha

Re: Straw poll: deprecating Scala 2.10 and Spark 1.x support

2018-08-02 Thread Pat Ferrel

+1

From: takako shimamoto  
Reply: user@predictionio.apache.org 

Date: August 2, 2018 at 2:55:49 AM
To: d...@predictionio.apache.org 
, user@predictionio.apache.org

Subject:  Straw poll: deprecating Scala 2.10 and Spark 1.x support

Hi all,

We're considering deprecating Scala 2.10 and Spark 1.x as of
the next release. Our intent is that using deprecated versions
can generate warnings, but that it should still work.

Nothing is concrete about actual removal of support at the moment, but
moving forward, use of Scala 2.11 and Spark 2.x will be recommended.
I think it's time to plan to deprecate 2.10 support, especially
with 2.12 coming soon.

This has an impact on some users, so if you see any issues with this,
please let us know as soon as possible.

Regards,
Takako

Re: 2 pio servers with 1 event server

2018-08-02 Thread Pat Ferrel

What template?

From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: August 2, 2018 at 9:08:05 AM
To: user@predictionio.apache.org 

Subject:  2 pio servers with 1 event server

Greetings,

I am trying to run 2 pio servers on different ports where each server have
his own app. When I deploy the first server, I get the results I want for
prediction on that server. However, after deplying the second server on a
different port, the results from the first server got changed. Any idea on
how can I fix that?

Or is there some kind of procedures I should follow to be able to run 2
prediction servers from 2 different app but share the same template?

Regards,
Sami serbey

Re: [actionml/universal-recommender] Boosting categories only shows one category type (#55)

2018-07-06 Thread Pat Ferrel

Please read the docs. There is no need to $set users since they are
attached to usage events and can be detected automatically. In fact
"$set"ting them is ignored. There are no properties of users that are not
calculated based on named “indicators’, which can be profile type things.

Fot this application I’d ask myself what you want the user to do? Do you
want them to view a house listing or schedule a visit? Clearly you want
them to rent but there will only be one rent per user so it is not very
strong for indicating taste.

If you have something like 10 visits per user on average you may have
enough to use as the primary indicator since visits are closer to “rent”,
intuitively, Page views, which may be 10x - 100x more than visits are your
last resort. But if page views is the best “primary” indicators you have,
still use visits and rents as secondary. Users have many motivations for
looking at listing and they may be only to look at higher priced units that
they have any intent of renting or to compare something they would not rent
to what they would. Therefor page views are somewhat removed from the pure
user intent of every “rent” but they may be the best indicator you have.

Also consider using things like search terms as secondary indicators.

Then send primary and all secondary events with whatever ids correspond to
the event type. User profile data is harder to use and not a useful as
people think but is still encoded as an indicator but with different
“eventName”. Something like “location” could be used and might have and id
like a postal code—something that is large enough to include other users
but small enough to be exclusive also.

The above will give you several “usage events” with one primary.

Business rule—which are used to restrict results—require you to $set
properties for every rental. So anything in the fields part of a query must
correspond to a possible property of items. Those look ok below.

Please use the Google group for questions. Github is for bug reports.


From: Amit Assaraf  
Reply: actionml/universal-recommender


Date: July 6, 2018 at 10:11:10 AM
To: actionml/universal-recommender


Cc: Subscribed 

Subject:  [actionml/universal-recommender] Boosting categories only shows
one category type (#55)

I have an app that uses Universal Recommender. The app is an app for
finding a house for rent.
I want to recommend users houses based on houses they viewed or scheduled a
tour on already.

I added all the users using the $set event.
I added all (96,676) the houses in the app like so:

predictionio_client.create_event(
event="$set",
entity_type="item",
entity_id=listing.meta.id,
properties={
  "property_type": ["villa"] # There are many
types of property_types such as "apartment"
}
)

And I add the events of the house view & schedule like so:

predictionio_client.create_event(
event="view",
entity_type="user",
entity_id=request.user.username,
target_entity_type="item",
target_entity_id=listing.meta.id
)

Now I want to get predictions for my users based on the property_types they
like.
So I send a prediction query boosting the property_types they like using
Business Rules like so:

{
'fields': [
{
 'bias': 1.05,
 'values': ['single_family_home', 'private_house',
'villa', 'cottage'],
 'name': 'property_type'
}
 ],
 'num': 15,
 'user': 'amit70'
}

Which I would then expect that I would get recommendations of different
types such as private_house or villa or cottage. But for some weird reason
while having over 95,000 houses of different property types I only get
recommendations of *ONE* single type (in this case villa) but if I remove
it from the list it just recommends 10 houses of ONE different type.
This is the response of the query:

{
"itemScores": [
{
"item": "56.39233,-4.11707|villa|0",
"score": 9.42542
},
{
"item": "52.3288,1.68312|villa|0",
"score": 9.42542
},
{
"item": "55.898878,-4.617019|villa|0",
"score": 8.531346
},
{
"item": "55.90713,-3.27626|villa|0",
"score": 8.531346
},
.

I cant understand why this is happening. The elasticsearch query this
translates to is this:
GET /recommender/_search

{
  "from": 0,
  "size": 15,
  "query": {
"bool": {
  "should": [
{
  "terms": {
"schedule": [
  "32.1439352176,34.833260278|private_house|0",
  "31.7848439,35.2047335|apartment_for_sale|0"
]
  }
},
{
  "terms": {
"view": [
  "32.0734919,34.7722675|garden_apartment|0",
  "32.1375986782,34.8415740159|apartment|0",

Re: Digging into UR algorithm

2018-07-02 Thread Pat Ferrel

The CCO algorithm test for correlation with a statistic called the Log
Likelihood Ratio (LLR). This compares relative frequencies of 4 different
things 2 having to do with the entire dataset 2 having to do with the 2
events being compared for correlation. Popularity is normalized out of this
comparison but does play a small indirect part in having engough data to
make better guesses about correlation.

Also remember that the secondary event may have item-ids that are not part
of the primary event. For instance if you have good search data then one
(of several) secondary event might be (user-id, "searched-for”,
search-term) This as a secondary event has proven to be quite useful in at
least one dataset I’ve seen.


From: Pat Ferrel  
Reply: Pat Ferrel  
Date: July 2, 2018 at 12:18:16 PM
To: user@predictionio.apache.org 
, Sami Serbey 

Cc: actionml-user 

Subject:  Re: Digging into UR algorithm

The only requirement is that someone performed the primary event on A and
the secondary event is correlated to that primary event.

the UR can recommend to a user who has only performed the secondary event
on B as long as that is in the model. Makes no difference what subset of
events the user has performed, recommendations will work event if the user
has no primary events.

So think of the model as being separate from the user history of events.
Recs are made from user history—whatever it is, but the model must have
some correlated data for each event type you want to use from a user’s
history and sometimes on infrequently seen items there is no model data for
some event types.

Popularity has very little to do with recommendations except for the fact
that you are more likely to have good correlated events. In fact we do
things to normalize/down weight highly popular things because otherwise
recommendations are worse. You can tell this by doing cross-validation
tests for popular vs collaborative filtering using the CCO algorithm behind
the UR.

If you want popular items you can make a query with no user-id and you will
get the most popular. Also if there are not enough recommendations for a
user’s history data we fill in with popular.

Your questions don’t quite match how the algorithm works so hopefully this
straightens out some things.

BTW community support for the UR is here:
https://groups.google.com/forum/#!forum/actionml-user


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: July 2, 2018 at 9:32:01 AM
To: user@predictionio.apache.org 

Subject:  Digging into UR algorithm

Hi guys,

So I've been playing around with the UR algorithm and I would like to know
2 things if it is possible:

1- Does UR recommend items that are linked to primary event only? Like if
item A is pruchased (primary event) 1 time and item B is liked (secondary
event) 50 times, does UR only recommend item A as the popular one even
though item B have x50 secondary event? Is there a way to play around this?

2- When I first read about UR I thought that it recommend items based on
the frequency of secondary events to primary events. ie: if 50 likes
(secondary event) of item A lead to the purchase of item B and 1 view
(secondary event) of item A lead to the purchase of item C, when someone
view and like item A he will get recommended item B and C with equal score
disregarding the 50 likes vs 1 view. Is that the correct behavior or am I
missing something? Does all secondary event have same weight of influence
for the recommender?

I hope that you can help me out understanding UR template.

Regards,
Sami Serbey



--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAOtZQD8CU5fVvZ9C32Cj6YaC1F%2B7oxWF%2Br21ApKnuajOZOFuoA%40mail.gmail.com
<https://groups.google.com/d/msgid/actionml-user/CAOtZQD8CU5fVvZ9C32Cj6YaC1F%2B7oxWF%2Br21ApKnuajOZOFuoA%40mail.gmail.com?utm_medium=email_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

Re: Digging into UR algorithm

2018-07-02 Thread Pat Ferrel

The only requirement is that someone performed the primary event on A and
the secondary event is correlated to that primary event.

the UR can recommend to a user who has only performed the secondary event
on B as long as that is in the model. Makes no difference what subset of
events the user has performed, recommendations will work event if the user
has no primary events.

So think of the model as being separate from the user history of events.
Recs are made from user history—whatever it is, but the model must have
some correlated data for each event type you want to use from a user’s
history and sometimes on infrequently seen items there is no model data for
some event types.

Popularity has very little to do with recommendations except for the fact
that you are more likely to have good correlated events. In fact we do
things to normalize/down weight highly popular things because otherwise
recommendations are worse. You can tell this by doing cross-validation
tests for popular vs collaborative filtering using the CCO algorithm behind
the UR.

If you want popular items you can make a query with no user-id and you will
get the most popular. Also if there are not enough recommendations for a
user’s history data we fill in with popular.

Your questions don’t quite match how the algorithm works so hopefully this
straightens out some things.

BTW community support for the UR is here:
https://groups.google.com/forum/#!forum/actionml-user


From: Sami Serbey 

Reply: user@predictionio.apache.org 

Date: July 2, 2018 at 9:32:01 AM
To: user@predictionio.apache.org 

Subject:  Digging into UR algorithm

Hi guys,

So I've been playing around with the UR algorithm and I would like to know
2 things if it is possible:

1- Does UR recommend items that are linked to primary event only? Like if
item A is pruchased (primary event) 1 time and item B is liked (secondary
event) 50 times, does UR only recommend item A as the popular one even
though item B have x50 secondary event? Is there a way to play around this?

2- When I first read about UR I thought that it recommend items based on
the frequency of secondary events to primary events. ie: if 50 likes
(secondary event) of item A lead to the purchase of item B and 1 view
(secondary event) of item A lead to the purchase of item C, when someone
view and like item A he will get recommended item B and C with equal score
disregarding the 50 likes vs 1 view. Is that the correct behavior or am I
missing something? Does all secondary event have same weight of influence
for the recommender?

I hope that you can help me out understanding UR template.

Regards,
Sami Serbey

Re: a question about a high availability of Elasticsearch cluster

2018-06-22 Thread Pat Ferrel

This should work with any node down. Elasticsearch should elect a new
master.

What version of PIO are you using? PIO and the UR changed the client from
the transport client to the RET client in 0.12.0, which is why you are
using port 9200.

Do all PIO functions work correctly like:

   - pio app list
   - pio app new

with all the configs and missing nodes you describe? What I’m trying to
find out is if the problem is only with queries, which do use ES is a
different way.

What is the es.nodes setting in the engine.json’s sparkConf?


From: jih...@braincolla.com  
Date: June 22, 2018 at 12:53:48 AM
To: actionml-user 

Subject:  a question about a high availability of Elasticsearch cluster

Hello Pat,

May I have a question about Elasticsearch cluster in PIO and UR.

I've implemented some Elasticsearch cluster consisted of 3 nodes on below
options.

**
cluster.name: my-search-cluster
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“node 1”, “node 2", “node 3”]

And I writed PIO options below.

**
...
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch

# The next line should match the ES cluster.name in ES config
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=my-search-cluster

# For clustered Elasticsearch (use one host/port if not clustered)
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=node1,node2,node3
PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200
...

My questions are below.

1. I killed the Elasticsearch process in node 2 or node 3. PIO is well
working. But when the Elasticsearch process in node 1 is killed, PIO is not
working. Is it right?

2. I've changed PIO options below. I killed the Elasticsearch process in
node 1 or node 3. PIO is well working. But when the Elasticsearch in node 2
is killed, PIO is not working. Is it right?
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=node2,node1,node3

3. In my opinion, if first node configurd at
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS is killed, PIO is not working. Is
it right? If yes, please let me know why it happened.

Thank you.
--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/254aed9f-c975-4726-8b90-2ea80d6a2a34%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout.

Re: UR trending ranking as separate process

2018-06-20 Thread Pat Ferrel

Yes, we support “popular”, “trending”, and “hot” as methods for ranking items. 
The UR queries are backfilled with these items if there are not enough results. 
So if the users has little history and so only gets 5 out of 10 results based 
on this history, we will automatically return the other 5 from the “popular” 
results. This is the default, if there is no specific config for this.

If you query with no user or item, we will return only from “popular” or 
whatever brand of ranking you have setup.

To change which type of ranking you want you can specify the period to use in 
calculating the ranking and which method from “popular”, “trending”, and “hot”. 
These roughly correspond to # of conversion, speed of conversion, and 
acceleration in conversions, if that helps.

Docs here: http://actionml.com/docs/ur_config Search for “rankings" 


From: Sami Serbey 
Reply: user@predictionio.apache.org 
Date: June 20, 2018 at 10:25:53 AM
To: user@predictionio.apache.org , Pat Ferrel 

Cc: user@predictionio.apache.org 
Subject:  Re: UR trending ranking as separate process  

Hi George,

I didn't get your question but I think I am missing something. So you're using 
the Universal Recommender and you're getting a sorted output based on the 
trending items? Is that really a thing in this template? May I please know how 
can you configure the template to get such output? I really hope you can answer 
that. I am also working with the UR template.

Regards,
Sami Serbey

Get Outlook for iOS
From: George Yarish 
Sent: Wednesday, June 20, 2018 7:45:12 PM
To: Pat Ferrel
Cc: user@predictionio.apache.org
Subject: Re: UR trending ranking as separate process
 
Matthew, Pat

Thanks for the answers and concerns. Yes, we want to calculate every 30 minutes 
trending for the last X hours, there X might be even few days. So realtime 
analogy is correct. 

On Wed, Jun 20, 2018 at 6:50 PM, Pat Ferrel  wrote:
No the trending algorithm is meant to look at something like trends over 2 
days. This is because it looks at 2 buckets of conversion frequencies and if 
you cut them smaller than a day you will have so much bias due to daily 
variations that the trends will be invalid. In other words the ups and downs 
over a day period need to be made irrelevant and taking day long buckets is the 
simplest way to do this. Likewise for “hot” which needs 3 buckets and so takes 
3 days worth of data. 

Maybe what you need is to just count conversions for 30 minutes as a realtime 
thing. For every item, keep conversions for the last 30 minutes, sort them 
periodically by count. This is a Kappa style algorithm doing online learning, 
not really supported by PredictionIO. You will have to experiment with the 
length of time since a too small period will be very noisy, popping back and 
forth between items semi-randomly.


From: George Yarish 
Reply: user@predictionio.apache.org 
Date: June 20, 2018 at 8:34:10 AM
To: user@predictionio.apache.org 
Subject:  UR trending ranking as separate process 

Hi!

Not sure this is correct place to ask, since my question correspond to UR 
specifically, not to pio itself I guess. 

Anyway, we are using UR template for predictionio and we are about to use 
trending ranking for sorting UR output. If I understand it correctly ranking is 
created during training and stored in ES. Our training takes ~ 3 hours and we 
launch it daily by scheduler but for trending rankings we want to get actual 
information every 30 minutes.

That means we want to separate training (scores calculation) and ranking 
calculation and launch them by different schedule.

Is there any easy way to achieve it? Does UR supports something like this?

Thanks,
George



-- 






George Yarish, Java Developer


Grid Dynamics


197101, Rentgena Str., 5A, St.Petersburg, Russia

Cell: +7 950 030-1941


Read Grid Dynamics' Tech Blog

Re: UR trending ranking as separate process

2018-06-20 Thread Pat Ferrel

No the trending algorithm is meant to look at something like trends over 2
days. This is because it looks at 2 buckets of conversion frequencies and
if you cut them smaller than a day you will have so much bias due to daily
variations that the trends will be invalid. In other words the ups and
downs over a day period need to be made irrelevant and taking day long
buckets is the simplest way to do this. Likewise for “hot” which needs 3
buckets and so takes 3 days worth of data.

Maybe what you need is to just count conversions for 30 minutes as a
realtime thing. For every item, keep conversions for the last 30 minutes,
sort them periodically by count. This is a Kappa style algorithm doing
online learning, not really supported by PredictionIO. You will have to
experiment with the length of time since a too small period will be very
noisy, popping back and forth between items semi-randomly.


From: George Yarish  
Reply: user@predictionio.apache.org 

Date: June 20, 2018 at 8:34:10 AM
To: user@predictionio.apache.org 

Subject:  UR trending ranking as separate process

Hi!

Not sure this is correct place to ask, since my question correspond to UR
specifically, not to pio itself I guess.

Anyway, we are using UR template for predictionio and we are about to use
trending ranking for sorting UR output. If I understand it correctly
ranking is created during training and stored in ES. Our training takes ~ 3
hours and we launch it daily by scheduler but for trending rankings we want
to get actual information every 30 minutes.

That means we want to separate training (scores calculation) and ranking
calculation and launch them by different schedule.

Is there any easy way to achieve it? Does UR supports something like this?

Thanks,
George

Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-19 Thread Pat Ferrel

Yes, those instructions tell you to run HDFS in pseudo-cluster mode. What
do you see in the HDFS GUI on localhost:50070 ?

Those setup instructions create a pseudo-clustered Spark, and HDFS/HBase.
This runs on a single machine but as the page says, are configured so you
can easily expand to a cluster by replacing config to point to remote HDFS
or Spark clusters.

One fix, if you don’t want to run those services in pseudo-cluster mode is:

1) remove any mention of PGSQL or jdbc, we are not using it. These are not
found on the page you linked to and are not used.
2) on a single machine you can put the dummy/empty model file in LOCALFS so
change the lines
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://localhost:9000/models
to
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE= LOCALFS
PIO_STORAGE_SOURCES_HDFS_PATH=/path/to/models
substituting with a directory where you want to save models

Running them in a pseudo-cluster mode gives you GUIs to see job progress
and browse HDFS for files, among other things. We recommend it for helping
to debug problems when you get to large amounts of data and begin running
out of resources.


From: Anuj Kumar  
Date: June 19, 2018 at 10:35:02 AM
To: p...@occamsmachete.com  
Cc: user@predictionio.apache.org 
, actionml-u...@googlegroups.com
 
Subject:  Re: java.util.NoSuchElementException: head of empty list when
running train

Hi Pat,
  Read it on the below link

http://actionml.com/docs/single_machine

here is the pio-env.sh

SPARK_HOME=$PIO_HOME/vendors/spark-2.1.1-bin-hadoop2.6

POSTGRES_JDBC_DRIVER=$PIO_HOME/lib/postgresql-42.0.0.jar

MYSQL_JDBC_DRIVER=$PIO_HOME/lib/mysql-connector-java-5.1.41.jar

HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

HBASE_CONF_DIR=/usr/local/hbase/conf

PIO_FS_BASEDIR=$HOME/.pio_store

PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines

PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta

PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event

PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model

PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

PIO_STORAGE_SOURCES_PGSQL_TYPE=jdbc

PIO_STORAGE_SOURCES_PGSQL_URL=jdbc:postgresql://localhost/pio

PIO_STORAGE_SOURCES_PGSQL_USERNAME=pio

PIO_STORAGE_SOURCES_PGSQL_PASSWORD=pio

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/els

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=pio

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs

PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://localhost:9000/models

PIO_STORAGE_SOURCES_HBASE_TYPE=hbase

PIO_STORAGE_SOURCES_HBASE_HOME=/usr/local/hbase

Thanks,
Anuj Kumar



On Tue, Jun 19, 2018 at 9:16 PM Pat Ferrel  wrote:

> Can you show me where on the AML site it says to store models in HDFS, it
> should not say that? I think that may be from the PIO site so you should
> ignore it.
>
> Can you share your pio-env? You need to go through the whole workflow from
> pio build, pio train, to pio deploy using a template from the same
> directory and with the same engine.json and pio-env and I suspect something
> is wrong in pio-env.
>
>
> From: Anuj Kumar 
> 
> Date: June 19, 2018 at 1:28:11 AM
> To: p...@occamsmachete.com  
> Cc: user@predictionio.apache.org 
> , actionml-u...@googlegroups.com
>  
> Subject:  Re: java.util.NoSuchElementException: head of empty list when
> running train
>
> Tried with basic engine.json mentioned at UL site examples. Seems to work
> but got stuck at "pio deploy" throwing following error
>
> [ERROR] [OneForOneStrategy] Failed to invert: [B@35c7052
>
>
> before that "pio train" was successful but gave following error. I suspect
> because of this reason "pio deploy" is not working. Please help
>
> [ERROR] [HDFSModels] File /models/pio_modelAWQXIr4APcDlNQi8DwVj could only
> be replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
>
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
>
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2565)
>
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)
>
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
>
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
> at
> org.

Re: java.util.NoSuchElementException: head of empty list when running train

2018-06-19 Thread Pat Ferrel

Can you show me where on the AML site it says to store models in HDFS, it
should not say that? I think that may be from the PIO site so you should
ignore it.

Can you share your pio-env? You need to go through the whole workflow from
pio build, pio train, to pio deploy using a template from the same
directory and with the same engine.json and pio-env and I suspect something
is wrong in pio-env.


From: Anuj Kumar  
Date: June 19, 2018 at 1:28:11 AM
To: p...@occamsmachete.com  
Cc: user@predictionio.apache.org 
, actionml-u...@googlegroups.com
 
Subject:  Re: java.util.NoSuchElementException: head of empty list when
running train

Tried with basic engine.json mentioned at UL site examples. Seems to work
but got stuck at "pio deploy" throwing following error

[ERROR] [OneForOneStrategy] Failed to invert: [B@35c7052


before that "pio train" was successful but gave following error. I suspect
because of this reason "pio deploy" is not working. Please help

[ERROR] [HDFSModels] File /models/pio_modelAWQXIr4APcDlNQi8DwVj could only
be replicated to 0 nodes instead of minReplication (=1).  There are 0
datanode(s) running and no node(s) are excluded in this operation.

at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1726)

at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)

at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2565)

at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)

at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)

at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)

at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)


On Tue, Jun 19, 2018 at 10:45 AM Anuj Kumar 
wrote:

> Sure, here it is.
>
> {
>
>   "comment":" This config file uses default settings for all but the
> required values see README.md for docs",
>
>   "id": "default",
>
>   "description": "Default settings",
>
>   "engineFactory": "com.actionml.RecommendationEngine",
>
>   "datasource": {
>
> "params" : {
>
>   "name": "sample-handmad",
>
>   "appName": "np",
>
>   "eventNames": ["read", "search", "view", "category-pref"],
>
>   "minEventsPerUser": 1,
>
>   "eventWindow": {
>
> "duration": "300 days",
>
> "removeDuplicates": true,
>
> "compressProperties": true
>
>   }
>
> }
>
>   },
>
>   "sparkConf": {
>
> "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
>
> "spark.kryo.registrator":
> "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
>
> "spark.kryo.referenceTracking": "false",
>
> "spark.kryoserializer.buffer": "300m",
>
> "spark.executor.memory": "4g",
>
> "spark.executor.cores": "2",
>
> "spark.task.cpus": "2",
>
> "spark.default.parallelism": "16",
>
> "es.index.auto.create": "true"
>
>   },
>
>   "algorithms": [
>
> {
>
>   "comment": "simplest setup where all values are default, popularity
> based backfill, must add eventsNames",
>
>   "name": "ur",
>
>   "params": {
>
> "appName": "np",
>
>     "indexName": "np",
>
> "typeName": "items",
>
> "blacklistEvents": [],
>
> "comment": "must have data for the first event or the model will
> not build, other events are optional",
>
> "indicators": [
>
>

Re: Few Queries Regarding the Recommendation Template

2018-06-13 Thread Pat Ferrel

Wow that page should be reworded or removed. They are trying to talk about
ensemble models, which are a valid thing but they badly misapply it there.
The application to multiple data types is just wrong and I know because I
tried exactly what they are suggesting but with cross-validation tests to
measure how much worse things got.

For instance if you use buy and dislike what kind of result are you going
to get if you have 2 models? One set of results will recommend “buy” the
other will tell you what a user is likely to “dislike”. How do you combine
them?

Ensembles are meant to use multiple *algorithms* and do something like
voting on recommendations. But you have to pay close attention to what the
algorithm uses as input and what it recommends. All members of the ensemble
must recommend the same action to the user.

Whoever contributed this statement: The default algorithm described in DASE
<https://predictionio.apache.org/templates/similarproduct/dase/#algorithm> uses
user-to-item view events as training data. However, your application may
have more than one type of events which you want to take into account, such
as buy, rate and like events. One way to incorporate other types of events
to improve the system is to add another algorithm to process these events,
build a separated model and then combine the outputs of multiple algorithms
during Serving.

Is patently wrong. Ensembles must recommend the same action to users and
unless each algorithm in the ensemble is recommending the same thing (all
be it with slightly different internal logic) then you will get gibberish
out. The winner of the Netflix prize did an ensemble with 107 (IIRC)
different algorithms all using exactly the same input data. There is no
principle that says if you feed conflicting data into several ensemble
algorithms that you will get diamonds out.

Furthermore using view events is bad to begin with because the recommender
will recommend what it thinks you want to view. We did this once with a
large dataset from a big E-Com company where we did cross-validation tests
using “buy” alone, “view” alone,  and ensembles of “buy” and “view”. We got
far better results using buy alone than using buy with ~100x as many
“views". The intent of the user and how they find things to view is so
different than when they finally come to buy something that adding view
data got significantly worse results. This is because people have different
reasons to view—maybe a flashy image, maybe a promotion, maybe some
placement bias, etc. This type of browsing “noise” pollutes the data which
can no longer be used to recommend “buy”s. We did several experiments
including comparing several algorithms types with “buy” and “view” events.
“view” always lost to “buy” no matter the algo we used (they were all
unimodal). There may be some exception to this result out there but it will
be accidental, not because it is built into the algorithm. When I say this
worsened results I’m not talking about some tiny fraction of a %, I’m
talking about a decrease of 15-20%

You could argue that “buy”, “like”, and rate will produce similar results
but from experience I can truly say that view and dislike will not.

Since the method described on the site is so sensitive to the user intent
recorded in events I would never use something like that without doing
cross-validation tests and then you are talking about a lot of work. There
is no theoretical or algorithmic correlation detection built into the
ensemble method so you may or may not get good results and I can say
unequivocally that the exact thing they describe will give worse results
(or at least it did in our experiments). You cannot ignore the intent
behind the data you use as input unless this type of correlation detection
is built into the algorithm and with the ensemble method described this
issue is completely ignored.

The UR uses the Correlated Cross-Occurrence algorithm for this exact reason
and was invented to solve the problem we found using “buy” and “view” data
together.  Let’s take a ridiculous extreme and use “dislikes" to recommend
“likes”? Does that even make sense? Check out an experiment with CCO where
we did this exact thing:
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

OK, rant over :-) Thanks for bringing up one of the key issues being
addressed by modern recommenders—multimodality. It is being addressed in
scientific ways, unfortunately the page on PIO’s site gets it wrong.




From: KRISH MEHTA  
Reply: KRISH MEHTA  
Date: June 13, 2018 at 2:19:17 PM
To: Pat Ferrel  
Subject:  Re: Few Queries Regarding the Recommendation Template

I Understand but if I just want the likes, dislikes and views then I can
combine the algorithms right? Given in the link:
https://predictionio.apache.org/templates/similarproduct/multi-events-multi-algos/
I
hope this works.

On Jun 13, 2018, at 1:19 PM, Pat Ferrel  wrote:

I would strongly recommend against using rati

Re: True Negative - ROC Curve

2018-06-12 Thread Pat Ferrel

We do not use these for recommenders. The precision rate is low when the
lift in your KPI like sales is relatively high. This is not like
classification.

We use MAP@k with increasing values of k. This should yield a diminishing
mean average precision chart with increasing k. This tells you 2 things; 1)
you are guessing in the right order, Map@1 greater than MAP@2 means your
first guess is better than than your second. The rate of decrease tells you
how fast the precision drops off with higher k. And 2) the baseline MAP@k
for future comparisons to tuning your engine or in champion/challenger
comparisons before putting into A/B tests.

Also note that RMSE has been pretty much discarded as an offline metric for
recommenders, it only really gives you a metric for ratings, and who cares
about that. No one wants to optimize rating guess anymore, conversions are
all that matters and precision is the way to measure potential conversion
since it actually measures how precise our guess about that the user
actually converted on in the test set. Ranking is next most important since
you have a limited number of recommendations to show, you want the best
ranked first. MAP@k over a range of k does this but clients often try to
read sales lift in this and there is no absolute relationship. You can
guess at one once you have A/B test results, and you should also compare
non-recommendation results like random recs, or popular recs. If MAP is
lower or close to these, you may not have a good recommender or data.

AUC is not for every task. In this case the only positive is a conversion
in the test data and the only negative is the absence of conversion and the
ROC curve will be nearly useless


From: Nasos Papageorgiou 

Reply: user@predictionio.apache.org 

Date: June 12, 2018 at 7:17:04 AM
To: user@predictionio.apache.org 

Subject:  True Negative - ROC Curve

Hi all,

I want to use ROC curve (AUC - Area Under the Curve) for evaluation of
recommended system in case of retailer. Could you please give an example of
True Negative value?

i.e. True Positive is the number of items on the Recommended List that are
appeared on the test data set, where the test data set may be the 20%  of
the full data.

Thank you.




Virus-free.
www.avast.com

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Regarding Real-Time Prediction

2018-06-11 Thread Pat Ferrel

Actually if you are using the Universal Recommender you only need to deploy 
once as long as the engine.json does not change. The hot swap happens as 
@Digambar says and there is literally no downtime. If you are using any of the 
other recommenders you do have to re-deploy after every train but the deploy 
happens very quickly, a ms or 2 as I recall.

From: Digambar Bhat 
Reply: user@predictionio.apache.org 
Date: June 11, 2018 at 9:38:15 AM
To: user@predictionio.apache.org 
Subject:  Re: Regarding Real-Time Prediction  

You don't need to deploy same engine again and again. You just deploy once and 
train whenever you want. Deployed instance will automatically point to newly 
trained model as hot swap happens. 

Regards,
Digambar

On Mon 11 Jun, 2018, 10:02 PM KRISH MEHTA,  wrote:
Hi,
I have just started using PredictionIO and according to the documentation I 
have to always run the Train and Deploy Command to get the prediction. I am 
working on predicting videos for recommendation and I want to know if there is 
any other way possible so that I can predict the results on the Fly with no 
Downtime.

Please help me with the same.

Yours Sincerely,
Krish

Re: UR template minimum event number to recommend

2018-06-04 Thread Pat Ferrel

No but we have 2 ways to handle this situation automatically and you can
tell if recommendations are not from personal user history.


   1. when there is not enough user history to recommend, we fill in the
   lower ranking recommendations with popular, trending, or hot items. Not
   completely irrelevant but certainly not as good as if we had more data for
   them.
   2. You can also mix item and user-based recs. So if you have an item,
   perhaps from the page or screen the user is looking at, you can send both
   user and item in the query. If you want user-based, boost it higher with
   the userBias. Then is the query cannot send back user-based it will fill in
   with item-based. This only works in certain situations where you have some
   example item.

As always if you do a user-based query and all scores are 0, you know that
no real recommendations are included and can take some other action.


From: Krajcs Ádám  
Reply: user@predictionio.apache.org 

Date: June 4, 2018 at 5:14:33 AM
To: user@predictionio.apache.org 

Subject:  UR template minimum event number to recommend

Hi,



Is it possible to configure somehow the universal recommender to recommend
items to user with minimum number of event? For example the user with 2
view events usually get unrelevant recommendations, but 5 events would be
enough.



Thanks!



Regads,

Adam Krajcs

Re: PIO 0.12.1 with HDP Spark on YARN

2018-05-29 Thread Pat Ferrel

Yarn has to be started explicitly. Usually it is part of Hadoop and is
started with Hadoop. Spark only contains the client for Yarn (afaik).



From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 6:45:43 PM
To: user@predictionio.apache.org 

Subject:  Re: PIO 0.12.1 with HDP Spark on YARN

That's the command that I'm using but it gives me the exception that I
listed in the previous email.  I've installed a Spark standalone cluster
and am using that for training for now but would like to use Spark on YARN
eventually.

Are you using HDP? If so, what version of HDP are you using?  I'm using
*HDP-2.6.2.14.*



On Tue, May 29, 2018 at 8:55 PM, suyash kharade 
wrote:

> I use 'pio train -- --master yarn'
> It works for me to train universal recommender
>
> On Tue, May 29, 2018 at 8:31 PM, Miller, Clifford <
> clifford.mil...@phoenix-opsgroup.com> wrote:
>
>> To add more details to this.  When I attempt to execute my training job
>> using the command 'pio train -- --master yarn' I get the exception that
>> I've included below.  Can anyone tell me how to correctly submit the
>> training job or what setting I need to change to make this work.  I've made
>> not custom code changes and am simply using PIO 0.12.1 with the
>> SimilarProduct Recommender.
>>
>>
>>
>> [ERROR] [SparkContext] Error initializing SparkContext.
>> [INFO] [ServerConnector] Stopped Spark@1f992a3a{HTTP/1.1}{0.0.0.0:4040}
>> [WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to request
>> executors before the AM has registered!
>> [WARN] [MetricsSystem] Stopping a MetricsSystem that is not running
>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$se
>> tEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:154)
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$se
>> tEnvFromInputString$1.apply(YarnSparkHadoopUtil.scala:152)
>> at scala.collection.IndexedSeqOptimized$class.foreach(
>> IndexedSeqOptimized.scala:33)
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.
>> scala:186)
>> at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.setEnvFrom
>> InputString(YarnSparkHadoopUtil.scala:152)
>> at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$
>> 6.apply(Client.scala:819)
>> at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$
>> 6.apply(Client.scala:817)
>> at scala.Option.foreach(Option.scala:257)
>> at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.sc
>> ala:817)
>> at org.apache.spark.deploy.yarn.Client.createContainerLaunchCon
>> text(Client.scala:911)
>> at org.apache.spark.deploy.yarn.Client.submitApplication(Client
>> .scala:172)
>> at org.apache.spark.scheduler.cluster.YarnClientSchedulerBacken
>> d.start(YarnClientSchedulerBackend.scala:56)
>> at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSched
>> ulerImpl.scala:156)
>> at org.apache.spark.SparkContext.(SparkContext.scala:509)
>> at org.apache.predictionio.workflow.WorkflowContext$.apply(
>> WorkflowContext.scala:45)
>> at org.apache.predictionio.workflow.CoreWorkflow$.runTrain(
>> CoreWorkflow.scala:59)
>> at org.apache.predictionio.workflow.CreateWorkflow$.main(Create
>> Workflow.scala:251)
>> at org.apache.predictionio.workflow.CreateWorkflow.main(CreateW
>> orkflow.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>> $SparkSubmit$$runMain(SparkSubmit.scala:751)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>> .scala:187)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.
>> scala:212)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:
>> 126)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>>
>>
>> On Tue, May 29, 2018 at 12:01 AM, Miller, Clifford <
>> clifford.mil...@phoenix-opsgroup.com> wrote:
>>
>>> So updating the version in the RELEASE file to 2.1.1 fixed the version
>>> detection problem but I'm still not able to submit Spark jobs unless they
>>> are strictly local.  How are you submitting to the HDP Spark?
>>>
>>> Thanks,
>>>
>>> --Cliff.
>>>
>>>
>>>
>>> On Mon, May 28, 2018 at 1:12 AM, suyash kharade <
>>> suyash.khar...@gmail.com> wrote:
>>>
 Hi Miller,
 I faced same issue.
 It is giving error as release file has '-' in version
 Insert simple version in release file something like 2.6.

 On Mon, May 28, 2018 at 4:32 AM, Miller,

Re: Spark cluster error

2018-05-29 Thread Pat Ferrel

Yes, the spark-submit --jars is where we started to find the missing class.
The class isn’t found on the remote executor so we looked in the jars
actually downloaded into the executor’s work dir. the PIO assembly jars are
there are do have the classes. This would be in the classpath of the
executor, right? Not sure what you are asking.

Are you asking about the SPARK_CLASSPATH in spark-env.sh? The default
should include the work subdir for the job, I believe. and it can only be
added to so we couldn’t have messed that up if it points first to the
work/job-number dir, right?

I guess the root of my question is how can the jars be downloaded to the
executor’s work dir and still the classes we know are in the jar are not
found?


From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 1:27:03 PM
To: user@predictionio.apache.org 

Subject:  Re: Spark cluster error

Sorry, what I meant was the actual spark-submit command that PIO was using.
It should be in the log.

What Spark version was that? I recall classpath issues with certain
versions of Spark.

On Thu, May 24, 2018 at 4:52 PM, Pat Ferrel  wrote:

> Thanks Donald,
>
> We have:
>
>- built pio with hbase 1.4.3, which is what we have deployed
>- verified that the `ProtobufUtil` class is in the pio hbase assembly
>- verified the assembly is passed in --jars to spark-submit
>- verified that the executors receive and store the assemblies in the
>FS work dir on the worker machines
>- verified that hashes match the original assembly so the class is
>being received by every executor
>
> However the executor is unable to find the class.
>
> This seems just short of impossible but clearly possible. How can the
> executor deserialize the code but not find it later?
>
> Not sure what you mean the classpath going in to the cluster? The classDef
> not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
> where it should get it?
>
> Thanks again
> p
>
>
> From: Donald Szeto  
> Reply: user@predictionio.apache.org 
> 
> Date: May 24, 2018 at 2:10:24 PM
> To: user@predictionio.apache.org 
> 
> Subject:  Re: Spark cluster error
>
> 0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
> Looking at Git history it has not changed in a while.
>
> Do you have the exact classpath that has gone into your Spark cluster?
>
> On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel  wrote:
>
>> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
>> Spark cluster? The issue seems to be how to pass the correct code to Spark
>> to connect to HBase:
>>
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> [ERROR] [TransportRequestHandler] Error while invoking
>> RpcHandler#receive() for one-way message.
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
>> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil
>> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convert
>> StringToScan(TableMapReduceUtil.java:521)
>> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(
>> TableInputFormat.java:110)
>> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRD
>> D.scala:170)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
>> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
>> (edited)
>>
>> Now that we have these pluggable DBs did I miss something? This works
>> with master=local but not with remote Spark master
>>
>> I’ve passed in the hbase-client in the --jars part of spark-submit, still
>> fails, what am I missing?
>>
>>
>> From: Pat Ferrel  
>> Reply: Pat Ferrel  
>> Date: May 23, 2018 at 8:57:32 AM
>> To: user@predictionio.apache.org 
>> 
>> Subject:  Spark cluster error
>>
>> Same CLI works using local Spark master, but fails using remote master
>> for a cluster due to a missing class def for protobuf used in hbase. We are
>> using the binary dist 0.12.1.  Is this known? Is there a work around?
>>
>> We are now trying a source build in hope the class will be put in the
>> assembly passed to Spark and the reasoning is that the executors don’t
>> contain hbase classes but when you run a local executor it does, due to
>> some local classpath. If the source built assembly does not have these
>> classes, we will have the same problem. Namely how to get protobuf to the
>> executors.
>>
>> Has anyone seen this?
>>
>>
>

Re: pio app new failed in hbase

2018-05-29 Thread Pat Ferrel

No, this is as expected. When you run pseudo-distributed everything
internally is configured as if the services were on separate machines. See
clustered instructions here: http://actionml.com/docs/small_ha_cluster This
is to setup 3 machines running different parts and is not really the best
physical architecture but does illustrate how a distributed setup would go.

BTW we (ActionML) use containers now to do this setup but it still works.
The smallest distributed cluster that makes sense for the Universal
Recommender is 5 machines. 2 dedicated to Spark, which can be started and
stopped around the `pio train` process. So 3 are permanent; one for PIO
servers (EventServer and PredictionServer) one for HDFS+HBase, one for
Elasticsearch. This allows you to vertically scale by increasing the size
of the service instances in-place (easy with AWS), then horizontally scale
HBase or Elasticsearch, or Spark independently if vertical scaling is not
sufficient. You can also combine the 2 Spark instances as long as you
remember that the `pio train` process creates a Spark Driver on the machine
the process is launched on and so the driver may need to be nearly as
powerful as a Spark Executor. The Spark Driver is an “invisible" and
therefore often overlooked member of the Spark cluster. It is often but not
always smaller than the executors, to put it on the PIO servers machine is
therefore dangerous in terms of scaling unless you know the resources it
will need. Using Yarn can but the Driver on the cluster (off the launching
machine) but is more complex than the default Spark “standalone” config.

The Universal Recommender is the exception here because it does not require
a big non-local Spark for anything but training, so we move the `pio train`
process to a Spark “Driver” machine that is ephemeral as the Spark
Executor(s) is(are). Other templates may require Spark in train and deploy.
Once the UR’s training is done it will automatically swap in the new model
so the running deployed PredictionServer will automatically start using
it—no re-deploy needed.


From: Marco Goldin  
Reply: user@predictionio.apache.org 

Date: May 29, 2018 at 6:38:21 AM
To: user@predictionio.apache.org 

Subject:  Re: pio app new failed in hbase

i was able to solve the issue deleting hbase folder in hdfs with "hdfs dfs
-rm -r /hbase" and restarting hbase.
now app creation in pio is working again.

I still wonder why this problem happen though, i'm running hbase in
pseudo-distributed mode (for testing purposes everything, from spark to
hadoop, is in a single machine), could a problem for prediction in managing
the apps?

2018-05-29 13:47 GMT+02:00 Marco Goldin :

> Hi all, i deleted all old apps from prediction (currently running 0.12.0)
> but when i'm creating a new one i get this error from hbase.
> I inspected hbase from shell but there aren't any table inside.
>
>
> ```
>
> pio app new mlolur
>
> [INFO] [HBLEvents] The table pio_event:events_1 doesn't exist yet.
> Creating now...
>
> Exception in thread "main" org.apache.hadoop.hbase.TableExistsException:
> org.apache.hadoop.hbase.TableExistsException: pio_event:events_1
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> prepareCreate(CreateTableProcedure.java:299)
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> executeFromState(CreateTableProcedure.java:106)
>
> at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.
> executeFromState(CreateTableProcedure.java:58)
>
> at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(
> StateMachineProcedure.java:119)
>
> at org.apache.hadoop.hbase.procedure2.Procedure.
> doExecute(Procedure.java:498)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(
> ProcedureExecutor.java:1147)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> execLoop(ProcedureExecutor.java:942)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> execLoop(ProcedureExecutor.java:895)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.
> access$400(ProcedureExecutor.java:77)
>
> at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$
> 2.run(ProcedureExecutor.java:497)
>
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:62)
>
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>
> at org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>
> at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:95)
>
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(
> RpcRetryingCaller.java:209)
>
> at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(
> RpcRetryingCaller.java:223)
>
> at

Re: PIO not using HBase cluster

2018-05-25 Thread Pat Ferrel

How are you starting the EventServer? You should not use pio-start-all
which assumes all services are local

configurre pio-env.sh with your remote hbase
start es with `pio eventserver &` or some method where it won’t kill the es
when you log off like `nohup pio eventserver &`
this should not start a local hbase so you should have your remote one
running
Same for the remote Elasticsearch and HDFS, they should be in pio-env.sh
and already started
pio status should be fine with the remote HBase


From: Miller, Clifford <clifford.mil...@phoenix-opsgroup.com>
<clifford.mil...@phoenix-opsgroup.com>
Reply: Miller, Clifford <clifford.mil...@phoenix-opsgroup.com>
<clifford.mil...@phoenix-opsgroup.com>
Date: May 25, 2018 at 10:16:01 AM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: PIO not using HBase cluster

I'll keep you informed.  However, I'm having issues getting past this.  If
I have hbase installed with the clusters config files then it still does
not communicate with the cluster.  It does start hbase but on the local PIO
server.  If I ONLY have the hbase config (which worked in version 0.10.0)
then pio-start-all gives the following message.


 pio-start-all
Starting Elasticsearch...
Starting HBase...
/home/centos/PredictionIO-0.12.1/bin/pio-start-all: line 65:
/home/centos/PredictionIO-0.12.1/vendors/hbase/bin/start-hbase.sh: No such
file or directory
Waiting 10 seconds for Storage Repositories to fully initialize...
Starting PredictionIO Event Server...


"pio status" then returns:


 pio status
[INFO] [Management$] Inspecting PredictionIO...
[INFO] [Management$] PredictionIO 0.12.1 is installed at
/home/centos/PredictionIO-0.12.1
[INFO] [Management$] Inspecting Apache Spark...
[INFO] [Management$] Apache Spark is installed at
/home/centos/PredictionIO-0.12.1/vendors/spark
[INFO] [Management$] Apache Spark 2.1.1 detected (meets minimum requirement
of 1.3.0)
[INFO] [Management$] Inspecting storage backend connections...
[INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
[INFO] [Storage$] Verifying Model Data Backend (Source: HDFS)...
[WARN] [DomainSocketFactory] The short-circuit local reads feature cannot
be used because libhadoop cannot be loaded.
[INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
[ERROR] [RecoverableZooKeeper] ZooKeeper exists failed after 1 attempts
[ERROR] [ZooKeeperWatcher] hconnection-0x558756be, quorum=localhost:2181,
baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
[WARN] [ZooKeeperRegistry] Can't retrieve clusterId from Zookeeper
[ERROR] [StorageClient] Cannot connect to ZooKeeper (ZooKeeper ensemble:
localhost). Please make sure that the configuration is pointing at the
correct ZooKeeper ensemble. By default, HBase manages its own ZooKeeper, so
if you have not configured HBase to use an external ZooKeeper, that means
your HBase is not started or configured properly.
[ERROR] [Storage$] Error initializing storage client for source HBASE.
org.apache.hadoop.hbase.ZooKeeperConnectionException: Can't connect to
ZooKeeper
at
org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:2358)
at
org.apache.predictionio.data.storage.hbase.StorageClient.(StorageClient.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:252)
at
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:283)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:244)
at
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at
scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:244)
at
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:315)
at
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:300)
at
org.apache.predictionio.data.storage.Storage$.getLEvents(Storage.scala:448)
at
org.apache.predictionio.data.storage.Storage$.verifyAllDataObjects(Storage.scala:384)
at
org.apache.predictionio.tools.commands.Management$.status(Management.scala

Re: PIO not using HBase cluster

2018-05-25 Thread Pat Ferrel

No, you need to have HBase installed, or at least the config installed on
the PIO machine. The pio-env.sh defined servers will be  configured cluster
operations and will be started separately from PIO. PIO then will not start
hbase and try to sommunicate only, not start it. But PIO still needs config
for the client code that is in the pio assembly jar.

Some services were not cleanly separated between client, master, and slave
so complete installation is easiest though you can figure out the minimum
with experimentation and I think it is just the conf directory.

BTW we have a similar setup and are having trouble with the Spark training
phase getting a `classDefNotFound: org.apache.hadoop.hbase.ProtobufUtil` so
can you let us know how it goes?



From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 25, 2018 at 9:43:46 AM
To: user@predictionio.apache.org 

Subject:  PIO not using HBase cluster

I'm attempting to use a remote cluster with PIO 0.12.1.  When I run
pio-start-all it starts the hbase locally and does not use the remote
cluster as configured.  I've copied the HBase and Hadoop conf files from
the cluster and put them into the locally configured directories.  I set
this up in the past using a similar configuration but was using PIO
0.10.0.  When doing this with this version I could start pio with only the
hbase and hadoop conf present.  This does not seem to be the case any
longer.

If I only put the cluster configs then it complains that it cannot find
start-hbase.sh.  If I put a hbase installation with cluster configs then it
will start a local hbase and not use the remote cluster.

Below is my PIO configuration



#!/usr/bin/env bash
#
# Safe config that will work if you expand your cluster later
SPARK_HOME=$PIO_HOME/vendors/spark
ES_CONF_DIR=$PIO_HOME/vendors/elasticsearch
HADOOP_CONF_DIR=$PIO_HOME/vendors/hadoop/conf
HBASE_CONF_DIR==$PIO_HOME/vendors/hbase/conf


# Filesystem paths where PredictionIO uses as block storage.
PIO_FS_BASEDIR=$HOME/.pio_store
PIO_FS_ENGINESDIR=$PIO_FS_BASEDIR/engines
PIO_FS_TMPDIR=$PIO_FS_BASEDIR/tmp

# PredictionIO Storage Configuration
PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

# Need to use HDFS here instead of LOCALFS to enable deploying to
# machines without the local model
PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=HDFS

# What store to use for what data
# Elasticsearch Example
PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=$PIO_HOME/vendors/elasticsearch
# The next line should match the ES cluster.name in ES config
PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=dsp_es_cluster

# For clustered Elasticsearch (use one host/port if not clustered)
PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=ip-10-0-1-136.us-gov-west-1.compute.internal,ip-10-0-1-126.us-gov-west-1.compute.internal,ip-10-0-1-126.us-gov-west-1.compute.internal
#PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9300,9300,9300
#PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=localhost
# PIO 0.12.0+ uses the REST client for ES 5+ and this defaults to
# port 9200, change if appropriate but do not use the Transport Client port
# PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200

PIO_STORAGE_SOURCES_HDFS_TYPE=hdfs
PIO_STORAGE_SOURCES_HDFS_PATH=hdfs://ip-10-0-1-138.us-gov-west-1.compute.internal:8020/models

# HBase Source config
PIO_STORAGE_SOURCES_HBASE_TYPE=hbase
PIO_STORAGE_SOURCES_HBASE_HOME=$PIO_HOME/vendors/hbase

# Hbase clustered config (use one host/port if not clustered)
PIO_STORAGE_SOURCES_HBASE_HOSTS=ip-10-0-1-138.us-gov-west-1.compute.internal,ip-10-0-1-209.us-gov-west-1.compute.internal,ip-10-0-1-79.us-gov-west-1.compute.internal
~

Re: Spark2 with YARN

2018-05-24 Thread Pat Ferrel

I’m having a java.lang.NoClassDefFoundError in a different context and
different class. Have you tried this without Yarn? Sorry I can’t find the
rest of this thread.

From: Miller, Clifford 

Reply: user@predictionio.apache.org 

Date: May 24, 2018 at 4:16:58 PM
To: user@predictionio.apache.org 

Subject:  Spark2 with YARN

I've setup a cluster using Hortonworks HDP with Ambari all running in AWS.
I then created a separate EC2 instance and installed PIO 0.12.1, hadoop,
elasticsearch, hbase, and spark2.  I copied the configurations from the HDP
cluster and then pio-start-all.  The pio-start-all completes successfully
and running "pio status" also shows success.  I'm following the "Text
Classification Engine Tutorial".  I've imported the data.  I'm using the
following command to train: "pio train -- --master yarn".  After running
the command I get the following exception.  Does anyone have any ideas of
what I may have missed during my setup?

Thanks in advance.

#
Exception follows:

Exception in thread "main" java.lang.NoClassDefFoundError:
com/sun/jersey/api/client/config/ClientConfig
at
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
at
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
at
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:152)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
at org.apache.spark.SparkContext.(SparkContext.scala:509)
at
org.apache.predictionio.workflow.WorkflowContext$.apply(WorkflowContext.scala:45)
at
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:59)
at
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)
at
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
com.sun.jersey.api.client.config.ClientConfig
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 20 more

##

Re: Spark cluster error

2018-05-24 Thread Pat Ferrel

Thanks Donald,

We have:

   - built pio with hbase 1.4.3, which is what we have deployed
   - verified that the `ProtobufUtil` class is in the pio hbase assembly
   - verified the assembly is passed in --jars to spark-submit
   - verified that the executors receive and store the assemblies in the FS
   work dir on the worker machines
   - verified that hashes match the original assembly so the class is being
   received by every executor

However the executor is unable to find the class.

This seems just short of impossible but clearly possible. How can the
executor deserialize the code but not find it later?

Not sure what you mean the classpath going in to the cluster? The classDef
not found does seem to be in the pio 0.12.1 hbase assembly, isn’t this
where it should get it?

Thanks again
p


From: Donald Szeto <don...@apache.org> <don...@apache.org>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 24, 2018 at 2:10:24 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: Spark cluster error

0.12.1 packages HBase 0.98.5-hadoop2 in the storage driver assembly.
Looking at Git history it has not changed in a while.

Do you have the exact classpath that has gone into your Spark cluster?

On Wed, May 23, 2018 at 1:30 PM, Pat Ferrel <p...@actionml.com> wrote:

> A source build did not fix the problem, has anyone run PIO 0.12.1 on a
> Spark cluster? The issue seems to be how to pass the correct code to Spark
> to connect to HBase:
>
> [ERROR] [TransportRequestHandler] Error while invoking
> RpcHandler#receive() for one-way message.
> [ERROR] [TransportRequestHandler] Error while invoking
> RpcHandler#receive() for one-way message.
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 4 in stage 0.0 failed 4 times, most recent
> failure: Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.hadoop.hbase.protobuf.ProtobufUtil
> at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.
> convertStringToScan(TableMapReduceUtil.java:521)
> at org.apache.hadoop.hbase.mapreduce.TableInputFormat.
> setConf(TableInputFormat.java:110)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(
> NewHadoopRDD.scala:170)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
> (edited)
>
> Now that we have these pluggable DBs did I miss something? This works with
> master=local but not with remote Spark master
>
> I’ve passed in the hbase-client in the --jars part of spark-submit, still
> fails, what am I missing?
>
>
> From: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
> Reply: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
> Date: May 23, 2018 at 8:57:32 AM
> To: user@predictionio.apache.org <user@predictionio.apache.org>
> <user@predictionio.apache.org>
> Subject:  Spark cluster error
>
> Same CLI works using local Spark master, but fails using remote master for
> a cluster due to a missing class def for protobuf used in hbase. We are
> using the binary dist 0.12.1.  Is this known? Is there a work around?
>
> We are now trying a source build in hope the class will be put in the
> assembly passed to Spark and the reasoning is that the executors don’t
> contain hbase classes but when you run a local executor it does, due to
> some local classpath. If the source built assembly does not have these
> classes, we will have the same problem. Namely how to get protobuf to the
> executors.
>
> Has anyone seen this?
>
>

Re: Spark cluster error

2018-05-23 Thread Pat Ferrel

A source build did not fix the problem, has anyone run PIO 0.12.1 on a
Spark cluster? The issue seems to be how to pass the correct code to Spark
to connect to HBase:

[ERROR] [TransportRequestHandler] Error while invoking RpcHandler#receive()
for one-way message.
[ERROR] [TransportRequestHandler] Error while invoking RpcHandler#receive()
for one-way message.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 4 in stage 0.0 failed 4 times, most recent failure:
Lost task 4.3 in stage 0.0 (TID 18, 10.68.9.147, executor 0):
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.hadoop.hbase.protobuf.ProtobufUtil
at
org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertStringToScan(TableMapReduceUtil.java:521)
at
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:110)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:170)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)```
(edited)

Now that we have these pluggable DBs did I miss something? This works with
master=local but not with remote Spark master

I’ve passed in the hbase-client in the --jars part of spark-submit, still
fails, what am I missing?


From: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
Reply: Pat Ferrel <p...@actionml.com> <p...@actionml.com>
Date: May 23, 2018 at 8:57:32 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Spark cluster error

Same CLI works using local Spark master, but fails using remote master for
a cluster due to a missing class def for protobuf used in hbase. We are
using the binary dist 0.12.1.  Is this known? Is there a work around?

We are now trying a source build in hope the class will be put in the
assembly passed to Spark and the reasoning is that the executors don’t
contain hbase classes but when you run a local executor it does, due to
some local classpath. If the source built assembly does not have these
classes, we will have the same problem. Namely how to get protobuf to the
executors.

Has anyone seen this?

Spark cluster error

2018-05-23 Thread Pat Ferrel

Same CLI works using local Spark master, but fails using remote master for
a cluster due to a missing class def for protobuf used in hbase. We are
using the binary dist 0.12.1.  Is this known? Is there a work around?

We are now trying a source build in hope the class will be put in the
assembly passed to Spark and the reasoning is that the executors don’t
contain hbase classes but when you run a local executor it does, due to
some local classpath. If the source built assembly does not have these
classes, we will have the same problem. Namely how to get protobuf to the
executors.

Has anyone seen this?

RE: Problem with training in yarn cluster

2018-05-23 Thread Pat Ferrel


at 
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:244)

at 
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:315)

at 
org.apache.predictionio.data.storage.Storage$.getPDataObject(Storage.scala:364)

at 
org.apache.predictionio.data.storage.Storage$.getPDataObject(Storage.scala:307)

at 
org.apache.predictionio.data.storage.Storage$.getPEvents(Storage.scala:454)

at 
org.apache.predictionio.data.store.PEventStore$.eventsDb$lzycompute(PEventStore.scala:37)

at 
org.apache.predictionio.data.store.PEventStore$.eventsDb(PEventStore.scala:37)

at 
org.apache.predictionio.data.store.PEventStore$.find(PEventStore.scala:73)

at com.actionml.DataSource.readTraining(DataSource.scala:76)

at com.actionml.DataSource.readTraining(DataSource.scala:48)

at 
org.apache.predictionio.controller.PDataSource.readTrainingBase(PDataSource.scala:40)

at org.apache.predictionio.controller.Engine$.train(Engine.scala:642)

at org.apache.predictionio.controller.Engine.train(Engine.scala:176)

at 
org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)

at 
org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:251)

at 
org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)

Caused by: com.google.protobuf.ServiceException:
java.net.UnknownHostException: unknown host: hbase-master

at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1678)

at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)

at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:42561)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(HConnectionManager.java:1682)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(HConnectionManager.java:1591)

at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$StubMaker.makeStub(HConnectionManager.java:1617)

... 36 more

Caused by: java.net.UnknownHostException: unknown host: hbase-master

at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.(RpcClient.java:385)

at 
org.apache.hadoop.hbase.ipc.RpcClient.createConnection(RpcClient.java:351)

at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1530)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)

at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)

... 41 more







*From: *Ambuj Sharma <am...@getamplify.com>
*Sent: *23 May 2018 08:59
*To: *user@predictionio.apache.org
*Cc: *Wojciech Kowalski <wojci...@tomandco.co.uk>
*Subject: *Re: Problem with training in yarn cluster



Hi wojciech,

 I also faced many problems while setting yarn with PredictionIO. This may
be the case where yarn is tyring to findout pio.log file on hdfs cluster.
You can try "--master yarn --deploy-mode client ". you need to pass this
configuration with pio train

e.g., pio train -- --master yarn --deploy-mode client








Thanks and Regards

Ambuj Sharma

Sunrise may late, But Morning is sure.

Team ML

Betaout



On Wed, May 23, 2018 at 4:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Actually you might search the archives for “yarn” because I don’t recall
how the setup works off hand.



Archives here:
https://lists.apache.org/list.html?user@predictionio.apache.org



Also check the Spark Yarn requirements and remember that `pio train … --
various Spark params` allows you to pass arbitrary Spark params exactly as
you would to spark-submit on the pio command line. The double dash
separates PIO and Spark params.




From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 22, 2018 at 4:07:38 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, Wojciech Kowalski <wojci...@tomandco.co.uk>
<wojci...@tomandco.co.uk>


Subject:  RE: Problem with training in yarn cluster



What is the command line for `pio train …` Specifically are you using

RE: Problem with training in yarn cluster

2018-05-22 Thread Pat Ferrel

Actually you might search the archives for “yarn” because I don’t recall
how the setup works off hand.

Archives here:
https://lists.apache.org/list.html?user@predictionio.apache.org

Also check the Spark Yarn requirements and remember that `pio train … --
various Spark params` allows you to pass arbitrary Spark params exactly as
you would to spark-submit on the pio command line. The double dash
separates PIO and Spark params.


From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 22, 2018 at 4:07:38 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, Wojciech Kowalski <wojci...@tomandco.co.uk>
<wojci...@tomandco.co.uk>
Subject:  RE: Problem with training in yarn cluster

What is the command line for `pio train …` Specifically are you using
yarn-cluster mode? This causes the driver code, which is a PIO process, to
be executed on an executor. Special setup is required for this.


From: Wojciech Kowalski <wojci...@tomandco.co.uk> <wojci...@tomandco.co.uk>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  RE: Problem with training in yarn cluster

Hello,



Actually I have another error in logs that is actually preventing train as
well:



[INFO] [RecommendationEngine$]



   _   _ __  __ _

 /\   | | (_)   |  \/  | |

/  \   ___| |_ _  ___  _ __ | \  / | |

   / /\ \ / __| __| |/ _ \| '_ \| |\/| | |

  /  \ (__| |_| | (_) | | | | |  | | |

 /_/\_\___|\__|_|\___/|_| |_|_|  |_|__|







[INFO] [Engine] Extracting datasource params...

[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.

[INFO] [Engine] Datasource params:
(,DataSourceParams(shop_live,List(purchase, basket-add, wishlist-add,
view),None,None))

[INFO] [Engine] Extracting preparator params...

[INFO] [Engine] Preparator params: (,Empty)

[INFO] [Engine] Extracting serving params...

[INFO] [Engine] Serving params: (,Empty)

[INFO] [log] Logging initialized @6774ms

[INFO] [Server] jetty-9.2.z-SNAPSHOT

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@1798eb08{/jobs,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@47c4c3cd{/jobs/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@3e080dea{/jobs/job,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@c75847b{/jobs/job/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@5ce5ee56{/stages,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@3dde94ac{/stages/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@4347b9a0{/stages/stage,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@63b1bbef{/stages/stage/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@10556e91{/stages/pool,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@5967f3c3{/stages/pool/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2793dbf6{/storage,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@49936228{/storage/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@7289bc6d{/storage/rdd,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@1496b014{/storage/rdd/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2de3951b{/environment,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@7f3330ad{/environment/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@40e681f2{/executors,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@61519fea{/executors/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@502b9596{/executors/threadDump,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@367b7166{/executors/threadDump/json,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@42669f4a{/static,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@2f25f623{/,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@23ae4174{/api,null,AVAILABLE,@Spark}

[INFO] [ContextHandler] Started
o.s.j.s.ServletContextHandler@4e33e426{/jobs/job/kill,n

RE: Problem with training in yarn cluster

2018-05-22 Thread Pat Ferrel

What is the command line for `pio train …` Specifically are you using 
yarn-cluster mode? This causes the driver code, which is a PIO process, to be 
executed on an executor. Special setup is required for this.

From: Wojciech Kowalski 
Reply: user@predictionio.apache.org 
Date: May 22, 2018 at 2:28:43 PM
To: user@predictionio.apache.org 
Subject:  RE: Problem with training in yarn cluster  

Hello,

Actually I have another error in logs that is actually preventing train as well:

[INFO] [RecommendationEngine$]  

   _   _ __  __ _
 /\   | | (_)   |  \/  | |
    /  \   ___| |_ _  ___  _ __ | \  / | |
   / /\ \ / __| __| |/ _ \| '_ \| |\/| | |
  /  \ (__| |_| | (_) | | | | |  | | |
 /_/    \_\___|\__|_|\___/|_| |_|_|  |_|__|

[INFO] [Engine] Extracting datasource params...
[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.
[INFO] [Engine] Datasource params: (,DataSourceParams(shop_live,List(purchase, 
basket-add, wishlist-add, view),None,None))
[INFO] [Engine] Extracting preparator params...
[INFO] [Engine] Preparator params: (,Empty)
[INFO] [Engine] Extracting serving params...
[INFO] [Engine] Serving params: (,Empty)
[INFO] [log] Logging initialized @6774ms
[INFO] [Server] jetty-9.2.z-SNAPSHOT
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@1798eb08{/jobs,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@47c4c3cd{/jobs/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@3e080dea{/jobs/job,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@c75847b{/jobs/job/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@5ce5ee56{/stages,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@3dde94ac{/stages/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@4347b9a0{/stages/stage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@63b1bbef{/stages/stage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@10556e91{/stages/pool,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@5967f3c3{/stages/pool/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2793dbf6{/storage,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@49936228{/storage/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@7289bc6d{/storage/rdd,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@1496b014{/storage/rdd/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2de3951b{/environment,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@7f3330ad{/environment/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@40e681f2{/executors,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@61519fea{/executors/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@502b9596{/executors/threadDump,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@367b7166{/executors/threadDump/json,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@42669f4a{/static,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@2f25f623{/,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@23ae4174{/api,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@4e33e426{/jobs/job/kill,null,AVAILABLE,@Spark}
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@38d9ae65{/stages/stage/kill,null,AVAILABLE,@Spark}
[INFO] [ServerConnector] Started Spark@17239b3{HTTP/1.1}{0.0.0.0:47948}
[INFO] [Server] Started @7040ms
[INFO] [ContextHandler] Started 
o.s.j.s.ServletContextHandler@16cffbe4{/metrics/json,null,AVAILABLE,@Spark}
[WARN] [YarnSchedulerBackend$YarnSchedulerEndpoint] Attempted to request 
executors before the AM has registered!
[ERROR] [ApplicationMaster] Uncaught exception:  

Thanks,

Wojciech

From: Wojciech Kowalski
Sent: 22 May 2018 23:20
To: user@predictionio.apache.org
Subject: Problem with training in yarn cluster

Hello, I am trying to setup distributed cluster with separate all services but 
i have problem while running train:

log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /pio/pio.log (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-11 Thread Pat Ferrel

BTW The Universal Recommender has it’s own community support group here:
https://groups.google.com/forum/#!forum/actionml-user

From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 11, 2018 at 10:07:25 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, Nasos Papageorgiou
<at.papageorg...@gmail.com> <at.papageorg...@gmail.com>
Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Yes but do you really care as a business about “users who viewed this also
viewed that”? I’d say no. You want to help them find what to buy and there
is a big difference between viewing and buying behavior. If you are only
interested in increasing time on site, or have ads shown that benefit from
more views then it might make more sense but a pure e-comm site would be
after sales.

The algorithm inside the UR can do all of these but only 1 and 2 are
possible with the current implementation. The Algorithm is call Correlated
Cross Occurrence and it can be targeted to recommend any recorded behavior.
On the theory that you would never want to throw away correlated behavior
in building models all behavior is taken into account so #1 could be
restated more precisely (but somewhat redundantly) as “people who viewed
(but then bought) this also viewed (and bought) these”. This targets what
you show people to “important” views. In fact if you are also using search
behavior and brand preferences it gets more wordy, “people who viewed this
(and bought, searched for, and preferred brands in a similar way) also
viewed” So you are showing viewed things that share the type of user like
the viewing user. You can just use one type of behavior, but why? Using all
makes the views more targeted.

So though it is possible to do 1-3 exactly as stated, you will get better
sales with the way described above.

Using my suggested method above #1 and #3 are the same.

   1. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]
   2. "eventNames”: [ “buy”,“view”, “search”, “brand-pref”]
   3. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]

If you want to do exactly as you have shown you’d have to throw out all
correlated cross-behavior.

   1. "eventNames”: [“view”]
   2. "eventNames”: [“buy”]
   3. "eventNames”: [“buy”, “view”] but then the internal model query would
   be only the current user’s view history. This is not supported in this
   exact form but could be added.

As you can see you are discarding a lot of valuable data if you insist on a
very pure interpretation of your 1-3 definitions, and I can promise you
that most knowledgable e-com sites do not mince words to finely.

From: Nasos Papageorgiou <at.papageorg...@gmail.com>
<at.papageorg...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 11, 2018 at 12:39:27 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Just a correction:  File on the first bullet is engine.json (not
events.json).

2018-05-10 17:01 GMT+03:00 Nasos Papageorgiou <at.papageorg...@gmail.com>:

>
>
> Hi all,
> to elaborate on these cases, the purpose is to create a UR for the cases
> of:
>
> 1.   “User who Viewed this item also Viewed”
>
> 2.   “User who Bought this item also Bought”
>
> 3.   “User who Viewed this item also Bought ”
>
> while having Events of Buying and Viewing a product.
> I would like to make some questions:
>
> 1.   On Data source Parameters, file: events.json: There is no matter
> on the sequence of the events which are defined. Right?
>
> 2.   If I specify one Event Type on the “eventNames” in Algorithm
> section (i.e. “view”)  and no event on the “blacklistEvents”,  is the
> second Event Type (i.e. “buy”) specified on the recommended list?
>
> 3.   If I use only "user" on the query, the "item case" will not be
> used for the recommendations. What is happening with the new users in
> that case?   Shall I use both "user" and "item" instead?
>
> 4.Values of less than 1 in “UserBias” and “ItemBias” on the query
> do not have any effect on the result.
>
> 5.    Is it feasible to build/train/deploy only once, and query for
> all 3 use cases?
>
>
> 6.   How to make queries towards the different Apps because there is
> no any obvious way in the query parameters or the URL?
>
> Thank you.
>
>
>
> *From:* Pat Ferrel [mailto:p...@occamsmachete.com]
> *Sent:* Wednesday, May 09, 2018 4:41 PM
> *To:* us

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-11 Thread Pat Ferrel

Yes but do you really care as a business about “users who viewed this also
viewed that”? I’d say no. You want to help them find what to buy and there
is a big difference between viewing and buying behavior. If you are only
interested in increasing time on site, or have ads shown that benefit from
more views then it might make more sense but a pure e-comm site would be
after sales.

The algorithm inside the UR can do all of these but only 1 and 2 are
possible with the current implementation. The Algorithm is call Correlated
Cross Occurrence and it can be targeted to recommend any recorded behavior.
On the theory that you would never want to throw away correlated behavior
in building models all behavior is taken into account so #1 could be
restated more precisely (but somewhat redundantly) as “people who viewed
(but then bought) this also viewed (and bought) these”. This targets what
you show people to “important” views. In fact if you are also using search
behavior and brand preferences it gets more wordy, “people who viewed this
(and bought, searched for, and preferred brands in a similar way) also
viewed” So you are showing viewed things that share the type of user like
the viewing user. You can just use one type of behavior, but why? Using all
makes the views more targeted.

So though it is possible to do 1-3 exactly as stated, you will get better
sales with the way described above.

Using my suggested method above #1 and #3 are the same.

   1. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]
   2. "eventNames”: [ “buy”,“view”, “search”, “brand-pref”]
   3. "eventNames”: [“view”, “buy”, “search”, “brand-pref”]

If you want to do exactly as you have shown you’d have to throw out all
correlated cross-behavior.

   1. "eventNames”: [“view”]
   2. "eventNames”: [“buy”]
   3. "eventNames”: [“buy”, “view”] but then the internal model query would
   be only the current user’s view history. This is not supported in this
   exact form but could be added.

As you can see you are discarding a lot of valuable data if you insist on a
very pure interpretation of your 1-3 definitions, and I can promise you
that most knowledgable e-com sites do not mince words to finely.


From: Nasos Papageorgiou <at.papageorg...@gmail.com>
<at.papageorg...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: May 11, 2018 at 12:39:27 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: UR: build/train/deploy once & querying for 3 use cases

Just a correction:  File on the first bullet is engine.json (not
events.json).

2018-05-10 17:01 GMT+03:00 Nasos Papageorgiou <at.papageorg...@gmail.com>:

>
>
> Hi all,
> to elaborate on these cases, the purpose is to create a UR for the cases
> of:
>
> 1.   “User who Viewed this item also Viewed”
>
> 2.   “User who Bought this item also Bought”
>
> 3.   “User who Viewed this item also Bought ”
>
> while having Events of Buying and Viewing a product.
> I would like to make some questions:
>
> 1.   On Data source Parameters, file: events.json: There is no matter
> on the sequence of the events which are defined. Right?
>
> 2.   If I specify one Event Type on the “eventNames” in Algorithm
> section (i.e. “view”)  and no event on the “blacklistEvents”,  is the
> second Event Type (i.e. “buy”) specified on the recommended list?
>
> 3.   If I use only "user" on the query, the "item case" will not be
> used for the recommendations. What is happening with the new users in
> that case?   Shall I use both "user" and "item" instead?
>
> 4.Values of less than 1 in “UserBias” and “ItemBias” on the query
> do not have any effect on the result.
>
> 5.Is it feasible to build/train/deploy only once, and query for
> all 3 use cases?
>
>
> 6.   How to make queries towards the different Apps because there is
> no any obvious way in the query parameters or the URL?
>
> Thank you.
>
>
>
> *From:* Pat Ferrel [mailto:p...@occamsmachete.com]
> *Sent:* Wednesday, May 09, 2018 4:41 PM
> *To:* user@predictionio.apache.org; gerasimos xydas
> *Subject:* Re: UR: build/train/deploy once & querying for 3 use cases
>
>
>
> Why do you want to throw away user behavior in making recommendations? The
> lift you get in purchases will be less.
>
>
>
> There is a use case for this when you are making recommendations basically
> inside a session where the user is browsing/viewing things on a hunt for
> something. In this case you would want to make recs using the user history
> of views but you have to build a model of purchase as the primary indicator
> or you won’t get purchase r

Re: UR evaluation

2018-05-10 Thread Pat Ferrel

Exactly, ranking is the only task of a recommender. Precision is not
automatically good at that but something like MAP@k is.


From: Marco Goldin <markomar...@gmail.com> <markomar...@gmail.com>
Date: May 10, 2018 at 10:09:22 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: UR evaluation

Very nice article. And it gets much clearer the importance of treating the
recommendation like a ranking task.
Thanks

Il gio 10 mag 2018, 19:12 Pat Ferrel <p...@occamsmachete.com> ha scritto:

> Here is a discussion of how we used it for tuning with multiple input
> types:
> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/
>
> We used video likes, dislikes, and video metadata to increase our MAP@k
> by 26% eventually. So this was mainly an exercise in incorporating data.
> Since this research was done we have learned how to better tune this type
> of situation but that’s a long story fit for another blog post.
>
>
> From: Marco Goldin <markomar...@gmail.com> <markomar...@gmail.com>
> Reply: user@predictionio.apache.org <user@predictionio.apache.org>
> <user@predictionio.apache.org>
> Date: May 10, 2018 at 9:54:23 AM
> To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
> Cc: user@predictionio.apache.org <user@predictionio.apache.org>
> <user@predictionio.apache.org>
> Subject:  Re: UR evaluation
>
> thank you very much, i didn't see this tool, i'll definitely try it.
> Clearly better to have such a specific instrument.
>
>
>
> 2018-05-10 18:36 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>:
>
>> You can if you want but we have external tools for the UR that are much
>> more flexible. The UR has tuning that can’t really be covered by the built
>> in API. https://github.com/actionml/ur-analysis-tools They do MAP@k as
>> well as creating a bunch of other metrics and comparing different types of
>> input data. They use a running UR to make queries against.
>>
>>
>> From: Marco Goldin <markomar...@gmail.com> <markomar...@gmail.com>
>> Reply: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Date: May 10, 2018 at 7:52:39 AM
>> To: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Subject:  UR evaluation
>>
>> hi all, i successfully trained a universal recommender but i don't know
>> how to evaluate the model.
>>
>> Is there a recommended way to do that?
>> I saw that *predictionio-template-recommender* actually has
>> the Evaluation.scala file which uses the class *PrecisionAtK *for the
>> metrics.
>> Should i use this template to implement a similar evaluation for the UR?
>>
>> thanks,
>> Marco Goldin
>> Horizons Unlimited s.r.l.
>>
>>
>

Re: UR evaluation

2018-05-10 Thread Pat Ferrel

Here is a discussion of how we used it for tuning with multiple input types: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/

We used video likes, dislikes, and video metadata to increase our MAP@k by 26% 
eventually. So this was mainly an exercise in incorporating data. Since this 
research was done we have learned how to better tune this type of situation but 
that’s a long story fit for another blog post.


From: Marco Goldin <markomar...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: May 10, 2018 at 9:54:23 AM
To: Pat Ferrel <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: UR evaluation  

thank you very much, i didn't see this tool, i'll definitely try it. Clearly 
better to have such a specific instrument.



2018-05-10 18:36 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>:
You can if you want but we have external tools for the UR that are much more 
flexible. The UR has tuning that can’t really be covered by the built in API. 
https://github.com/actionml/ur-analysis-tools They do MAP@k as well as creating 
a bunch of other metrics and comparing different types of input data. They use 
a running UR to make queries against.


From: Marco Goldin <markomar...@gmail.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: May 10, 2018 at 7:52:39 AM
To: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  UR evaluation

hi all, i successfully trained a universal recommender but i don't know how to 
evaluate the model.

Is there a recommended way to do that?
I saw that predictionio-template-recommender actually has the Evaluation.scala 
file which uses the class PrecisionAtK for the metrics. 
Should i use this template to implement a similar evaluation for the UR?

thanks,
Marco Goldin
Horizons Unlimited s.r.l.

Re: UR evaluation

2018-05-10 Thread Pat Ferrel

You can if you want but we have external tools for the UR that are much
more flexible. The UR has tuning that can’t really be covered by the built
in API. https://github.com/actionml/ur-analysis-tools They do MAP@k as well
as creating a bunch of other metrics and comparing different types of input
data. They use a running UR to make queries against.


From: Marco Goldin  
Reply: user@predictionio.apache.org 

Date: May 10, 2018 at 7:52:39 AM
To: user@predictionio.apache.org 

Subject:  UR evaluation

hi all, i successfully trained a universal recommender but i don't know how
to evaluate the model.

Is there a recommended way to do that?
I saw that *predictionio-template-recommender* actually has
the Evaluation.scala file which uses the class *PrecisionAtK *for the
metrics.
Should i use this template to implement a similar evaluation for the UR?

thanks,
Marco Goldin
Horizons Unlimited s.r.l.

Re: UR: build/train/deploy once & querying for 3 use cases

2018-05-09 Thread Pat Ferrel

Why do you want to throw away user behavior in making recommendations? The
lift you get in purchases will be less.

There is a use case for this when you are making recommendations basically
inside a session where the user is browsing/viewing things on a hunt for
something. In this case you would want to make recs using the user history
of views but you have to build a model of purchase as the primary indicator
or you won’t get purchase recommendations and believe me recommending views
is a road to bad results. People view many things they do not buy, putting
only view behavior that lead to purchases in the model. So create a model
with purchase as the primary indicator and view as the secondary.

Once you have the model use only the user’s session viewing history in the
as the Elasticsearch query.

This is a feature on our list.


From: gerasimos xydas 

Reply: user@predictionio.apache.org 

Date: May 9, 2018 at 6:20:46 AM
To: user@predictionio.apache.org 

Subject:  UR: build/train/deploy once & querying for 3 use cases

Hello everybody,

We are experimenting with the Universal Recommender to provide
recommendations for the 3 distinct use cases below:

- Get a product recommendation based on product views
- Get a product recommendation based on product purchases
- Get a product recommendation based on previous purchases and views (i.e.
users who viewed this bought that)

The event server is fed from a single app with two types of events: "view"
and "purchase".

1. How should we customize the query to fetch results for each separate
case?
2. Is it feasible to build/train/deploy only once, and query for all 3 use
cases?


Best Regards,
Gerasimos

Re: Info / resources for scaling PIO?

2018-04-24 Thread Pat Ferrel

PIO is based on the architecture of Spark, which uses HDFS. HBase also uses
HDFS. Scaling these are quite well documented on the web. Scaling PIO is
the same as scaling all it’s services. It is unlikely you’ll need it but
you can also have more than one PIO server behind a load balancer.

Don’t use local models, put them in HDFS. Don’t mess with NFS, it is not
the design point for PIO. Scaling Spark beyond one machine will require
HDFS anyway so use it.

I also advise against using ES for all storage. 4 things hit the event
storage, incoming events (input), training, where all events are read out
at high speed, optionally model storage (depending on the engine) and
queries usually hit the event storage. This will quickly overload one
service and ES is not built as an object retrieval DB. The only reason to
use ES for all storage is that it is convenient when doing development or
experimenting with engines. In production it would be risky to rely on ES
for all storage and you would still need to scale out Spark and therefore
HDFS.

There is a little written about various scaling models here:
http://actionml.com/docs/pio_by_actionml the the architecture and workflow
tab and there are a couple system install docs that cover scaling.


From: Adam Drew  
Reply: user@predictionio.apache.org 

Date: April 24, 2018 at 7:37:35 AM
To: user@predictionio.apache.org 

Subject:  Info / resources for scaling PIO?

Hi all!



Is there any info on how to scale PIO to multiple nodes? I’ve gone through
a lot of the docs on the site and haven’t found anything. I’ve tested PIO
running with HBASE and ES for metadata and events, and with using just ES
for both (my preference thusfar) and have my models on local storage. Would
scaling simply be a matter of deploying clustered ES, and then finding some
way to share my model storage, such as NFS or HDFS? The question then is
what (if anything) has to be done for the nodes to “know” about changes on
other nodes. For example, if the model gets trained on node A does node B
automatically know about that?



I hope that makes sense. I’m coming to PIO with no prior experience for the
underlying apache bits (spark, hbase / hdfs, etc) so there’s likely things
I’m not considering. Any help / docs / guidance is appreciated.



Thanks!

Adam

Re: pio deploy without spark context

2018-04-14 Thread Pat Ferrel

The need for Spark at query time depends on the engine. Which are you
using? The Universal Recommender, which I maintain, does not require Spark
for queries but uses PIO. We simply don’t use the Spark context so it is
ignored. To make PIO work you need to have the Spark code accessible but
that doesn’t mean there must be a Spark cluster, you can  set the Spark
master to “local” and there are no Spark resources used in the deployed pio
PredictionServer.

We have infra code to spin up a Spark cluster for training and bring it
back down afterward. This all works just fine. The UR PredictionServer also
has no need to be re-deployed since the model is hot-swapped after
training, Deploy once run forever. And no real requirement for Spark to do
queries.

So depending on the Engine the requirement for Spark is code level not
system level.

From: Donald Szeto  
Reply: user@predictionio.apache.org 

Date: April 13, 2018 at 4:48:15 PM
To: user@predictionio.apache.org 

Subject:  Re: pio deploy without spark context

Hi George,

This is unfortunately not possible now without modifying the source code,
but we are planning to refactor PredictionIO to be runtime-agnostic,
meaning the engine server would be independent and SparkContext would not
be created if not necessary.

We will start a discussion on the refactoring soon. You are very welcome to
add your input then, and any subsequent contribution would be highly
appreciated.

Regards,
Donald

On Fri, Apr 13, 2018 at 3:51 PM George Yarish 
wrote:

> Hi all,
>
> We use pio engine which doesn't require apache spark in serving time, but
> from my understanding anyway sparkContext will be created by "pio deploy"
> process by default.
> My question is there any way to deploy an engine avoiding creation of
> spark application if I don't need it?
>
> Thanks,
> George
>
>

Re: Hbase issue

2018-04-13 Thread Pat Ferrel

This may seem unhelpful now but for others it might be useful to mention some
minimum PIO in production best practices:

1) PIO should IMO never be run in production on a single node. When all
services share the same memory, cpu, and disk, it is very difficult to find the
root cause to a problem.
2) backup data with pio export periodically
3) install monitoring for disk used, as well as response times and other
factors so you get warnings before you get wedged.
4) PIO will store data forever. It is designed as an input only system. Nothing
is dropped ever. This is clearly unworkable in real life so a feature was added
to trim the event stream in a safe way in PIO 0.12.0. There is a separate
Template for trimming the DB and doing other things like deduplication and
other compression on some schedule that can and should be different than
training. Do not use this template until you upgrade and make sure it is
compatible with your template: https://github.com/actionml/db-cleaner

From: bala vivek
Reply: user@predictionio.apache.org
Date: April 13, 2018 at 2:50:26 AM
To: user@predictionio.apache.org
Subject: Re: Hbase issue

Hi Donald,

Yes, I'm running on the single machine. PIO, hbase , elasticsearch, spark
everything works on the same server. Let me know which file I need to remove
because I have client data present in PIO.

I have tried adding the entries in hbase-site.xml using the following link,
after which I can see the Hmaster seems active but still, the error remains the
same.

https://medium.com/@tjosepraveen/cant-get-connection-to-zookeeper-keepererrorcode-connectionloss-for-hbase-63746fbcdbe7

Hbase Error logs :- ( I have commented the server name)

2018-04-13 04:31:28,246 INFO [RS:0;VD500042:49584-SendThread(localhost:2182)]
zookeeper.ClientCnxn: Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2182. Will not attempt to authenticate using SASL
(unknown error)
2018-04-13 04:31:28,247 WARN [RS:0;XX:49584-SendThread(localhost:2182)]
zookeeper.ClientCnxn: Session 0x162be5554b90003 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2018-04-13 04:31:28,553 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Master not initialized after 20ms seconds
at
org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:225)
at
org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:449)
at
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:225)
at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:137)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2436)
(END)

I have tried multiple time pio-stop-all and pio-start-all but no luck the
service is not up.
If I install the hbase alone in the existing setup let me know what things I
should consider. If anyone faced this issue please provide me the solution
steps.

On Thu, Apr 12, 2018 at 9:13 PM, Donald Szeto wrote:
Hi Bala,

Are you running a single-machine HBase setup? The ZooKeeper embedded in such a
setup is pretty fragile to disk space issue and your ZNode might have corrupted.

If that’s indeed your setup, please take a look at HBase log files,
specifically on messages from ZooKeeper. In this situation, one way to recover
is to remove ZooKeeper files and let HBase recreate them, assuming from your
log output that you don’t have other services depend on the same ZK.

Regards,
Donald

On Thu, Apr 12, 2018 at 5:34 AM bala vivek wrote:
Hi,

I use PIO 0.10.0 version and hbase 1.2.4. The setup was working fine till today
morning. I saw PIO was down as the mount space issue was present on the server
and cleared the unwanted files.

After doing a pio-stop-all and pio-start-all the HMaster service is not
working. I tried multiple times the pio restart.

I can see whenever I do a pio-stop-all and check the service using jps, the
Hmaster seems running. Similarly I tried to run the ./start-hbase.sh script but
still pio status is not showing as success.

pio error log :

[INFO] [Console$] Inspecting PredictionIO...
[INFO] [Console$] PredictionIO 0.10.0-incubating is installed at
/opt/tools/PredictionIO-0.10.0-incubating
[INFO] [Console$]

Re: how to set engine-variant in intellij idea

2018-04-10 Thread Pat Ferrel

There are instructions for using Intellij but, I wrote the last version, I
apologize that I can’t make them work anymore. If you get them to work you
would be doing the community a great service by telling us how or editing
the instructions.

http://predictionio.apache.org/resources/intellij/


From: qi zhang  
Reply: user@predictionio.apache.org 

Date: April 10, 2018 at 1:40:58 AM
To: user@predictionio.apache.org 

Subject:  how to set engine-variant in intellij idea

大家好：
   我用intellij idea部署模型遇到如下问题

请问engine-variant是什么，我在哪里可以得到这个参数的值，能否帮忙举一个例子告诉我怎么设置这个参数
谢谢！非常感谢！


ii_jftezj9g1_162aeb5cfe5db27b
Description: Binary data


ii_jftemvdq0_162aeaccac8ddd04
Description: Binary data

Re: Unclear problem with using S3 as a storage data source

2018-03-29 Thread Pat Ferrel

Ok, the problem, as I thought at first, is that Spark creates the model and the 
PredictionServer must read it.

My methods below still work. There is very little extra to creating a pseudo 
cluster for HDFS as far a performance if it is still running all on one machine.

You can also write it on the Spark/training machine ot localfs and copy it to 
the PredictionServer before deploy. A simple scp in a script would do that.

Again I have no knowledge of using S3 for such things. If that works, someone 
else will have to help.




From: Dave Novelli <d...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: March 29, 2018 at 6:19:58 AM
To: Pat Ferrel <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source  

Sorry Pat, I think I took some shortcuts in my initial explanation that are 
causing some confusion :) I'll try laying everything out again in detail...

I have configured 2 servers in AWS:

Event/Prediction Server - t2.medium
- Runs permanently
- Using swap to deal with 4GB mem limit (I know, I know)
- ElasticSearch
- HBase (pseudo-distributed mode, using normal files instead of hdfs)
- Web server for events and 6 prediction models

Training Server - r4.large
- Only spun up to execute "pio train" for the 6 UR models I've configured then 
spun back down
- Spark

My specific problem is that running "pio train" on the training server when 
"LOCALFS" is set as the model data store will deposit all the stub files in 
.pio_store/models/.

When I run "pio deploy" on the Event/Prediction Server, it's looking for those 
files in the .pio_store/models/ directory on the Event/Prediction server, and 
they're obviously not there. If I manually copy the files from the Training 
server to the Event/Prediction server then "pio deploy" works as expected.

My thought is that if the Training server saves those model stub files to S3, 
then the Event/Prediction server can read those files from S3 and I won't have 
to manually copy them.


Hopefully this clears my situation up!


As a note - I realize t2.medium is not a feasible instance type for any 
significant production system, but I'm bootstrapping a demo system on a very 
tight budget for a site that will almost certainly have extremely low traffic. 
In my initial tests I've managed to get UR working on this configuration and 
will be doing some simple load testing soon to see how far I can push it before 
it crashes. Speed is obviously not an issue at the moment but once it is (and 
once there's some funding) that t2 will be replaced with an r4 or an m5

Cheers,
Dave


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 7:40 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
Sorry then I don’t understand what part has no access to the file system on the 
single machine? 

Also a t2 is not going to work with PIO. Spark 2 along requires something like 
2g for a do-nothing empty executor and driver, so a real app will require 16g 
or so minimum (my laptop has 16g). Run the OS, HBase, ES, and Spark will get 
you to over 8g, then add data. Spark keeps all data needed at a given phase of 
the calculation in memory across the cluster, that’s where it gets it’s speed. 
Welcome to big-data :-)


From: Dave Novelli <d...@ultravioletanalytics.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
Date: March 28, 2018 at 3:47:35 PM
To: Pat Ferrel <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

I don't *think* I need more spark nodes - I'm just using the one for training 
on an r4.large instance I spin up and down as needed.

I was hoping to avoid adding any additional computational load to my 
Event/Prediction/HBase/ES server (all running on a t2.medium) so I am looking 
for a way to *not* install HDFS on there as well. S3 seemed like it would be a 
super convenient way to pass the model files back and forth, but it sounds like 
it wasn't implemented as a data source for the model repository for UR.

Perhaps that's something I could implement and contribute? I can *kinda* read 
Scala haha, maybe this would be a fun learning project. Do you think it would 
be fairly straightforward?


Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

On Wed, Mar 28, 2018 at 6:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node address 
even though all storage is on one machine. Then you us

Re: Unclear problem with using S3 as a storage data source

2018-03-28 Thread Pat Ferrel

So you need to have more Spark nodes and this is the problem?

If so setup HBase on pseudo-clustered HDFS so you have a master node
address even though all storage is on one machine. Then you use that
version of HDFS to tell Spark where to look for the model. It give the
model a URI.

I have never used the raw S3 support, HDFS can also be backed by S3 but you
use HDFS APIs, it is an HDFS config setting to use S3.

It is a rather unfortunate side effect of PIO but there are 2 ways to solve
this with no extra servers.

Maybe someone else knows how to use S3 natively for the model stub?

From: Dave Novelli <d...@ultravioletanalytics.com>
<d...@ultravioletanalytics.com>
Date: March 28, 2018 at 12:13:12 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Cc: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Subject:  Re: Unclear problem with using S3 as a storage data source

Well, it looks like the local file system isn't an option in a multi-server
configuration without manually setting up a process to transfer those stub
model files.

I trained models on one heavy-weight temporary instance, and then when I
went to deploy from the prediction server instance it failed due to missing
files. I copied the .pio_store/models directory from the training server
over to the prediction server and then was able to deploy.

So, in a dual-instance configuration what's the best way to store the
files? I'm using pseudo-distributed HBase with standard file system storage
instead of HDFS (my current aim is keeping down cost and complexity for a
pilot project).

Is S3 back on the table as on option?

On Fri, Mar 23, 2018 at 11:03 AM, Dave Novelli <
d...@ultravioletanalytics.com> wrote:

> Ahhh ok, thanks Pat!
>
>
> Dave Novelli
> Founder/Principal Consultant, Ultraviolet Analytics
> www.ultravioletanalytics.com | 919.210.0948 <(919)%20210-0948> |
> d...@ultravioletanalytics.com
>
> On Fri, Mar 23, 2018 at 8:08 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> There is no need to have Universal Recommender models put in S3, they are
>> not used and only exist (in stub form) because PIO requires them. The
>> actual model lives in Elasticsearch and uses special features of ES to
>> perform the last phase of the algorithm and so cannot be replaced.
>>
>> The stub PIO models have no data and will be tiny. putting them in HDFS
>> or the local file system is recommended.
>>
>>
>> From: Dave Novelli <d...@ultravioletanalytics.com>
>> <d...@ultravioletanalytics.com>
>> Reply: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Date: March 22, 2018 at 6:17:32 PM
>> To: user@predictionio.apache.org <user@predictionio.apache.org>
>> <user@predictionio.apache.org>
>> Subject:  Unclear problem with using S3 as a storage data source
>>
>> Hi all,
>>
>> I'm using the Universal Recommender template and I'm trying to switch
>> storage data sources from local file to S3 for the model repository. I've
>> read the page at https://predictionio.apache.org/system/anotherdatastore/
>> to try to understand the configuration requirements, but when I run pio
>> train it's indicating an error and nothing shows up in the s3 bucket:
>>
>> [ERROR] [S3Models] Failed to insert a model to
>> s3://pio-model/pio_modelAWJPjTYM0wNJe2iKBl0d
>>
>> I created a new bucket named "pio-model" and granted full public
>> permissions.
>>
>> Seemingly relevant settings from pio-env.sh:
>>
>> PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model
>> PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=S3
>> ...
>>
>> PIO_STORAGE_SOURCES_S3_TYPE=s3
>> PIO_STORAGE_SOURCES_S3_REGION=us-west-2
>> PIO_STORAGE_SOURCES_S3_BUCKET_NAME=pio-model
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_ENDPOINT=http://s3.us-west-2.amazonaws.com
>>
>> # I've tried with and without this
>> #PIO_STORAGE_SOURCES_S3_BASE_PATH=pio-model
>>
>>
>> Any suggestions where I can start troubleshooting my configuration?
>>
>> Thanks,
>> Dave
>>
>>
>

--
Dave Novelli
Founder/Principal Consultant, Ultraviolet Analytics
www.ultravioletanalytics.com | 919.210.0948 | d...@ultravioletanalytics.com

Re: Error when training The Universal Recommender 0.7.0 with PredictionIO 0.12.0-incubating

2018-03-27 Thread Pat Ferrel

Pio build requires that ES hosts are known to Spark, which write the model
to ES. You can pass these in on the `pio train` command line:

pio train … -- --conf spark.es.nodes=“node1,node2,node3”

notice no spaces in the quoted list of hosts, also notice the double dash,
which separates pio parameters from Spark parameters.

There is a way to pass this in using the sparkConf section in engine.json
but this is unreliable due to how the commas are treated in ES. The site
description for the UR in the small HA cluster has not been updated for
0.7.0 because we are expecting a Mahout release, which will greatly simplfy
the build process described in the README.


From: VI, Tran Tan Phong  
Reply: user@predictionio.apache.org 

Date: March 27, 2018 at 3:09:30 AM
To: user@predictionio.apache.org 

Subject:  Error when training The Universal Recommender 0.7.0 with
PredictionIO 0.12.0-incubating

Hi,



I am trying to build and train UR 0.7.0 with PredictionIO 0.12.0-incubating
on a local “Small HA Cluster” (http://actionml.com/docs/small_ha_cluster)
using Elasticsearch 5.5.2.

By following different steps of the how-to, I success to execute the “pio
build” command of U.R 7.0. But I am getting some errors on the following
step of “pio train”.



Here are the principal errors:

…

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[INFO] [HttpMethodDirector] I/O exception (java.net.ConnectException)
caught when processing request: Connection refused (Connection refused)

[INFO] [HttpMethodDirector] Retrying request

[ERROR] [NetworkClient] Node [127.0.0.1:9200] failed (Connection refused
(Connection refused)); no other nodes left - aborting...

…



Exception in thread "main"
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES
version - typically this happens if the network/Elasticsearch cluster is
not accessible or when targeting a WAN/Cloud instance without the proper
setting 'es.nodes.wan.only'

…

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[127.0.0.1:9200]]



The cluster Elasticsearch (aml-elasticsearch) is up, but is not listening
on localhost.



Here under is my config of ES 5.5.2

PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_CLUSTERNAME=aml-elasticsearch

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOSTS=aml-master,aml-slave-1,aml-slave-2

PIO_STORAGE_SOURCES_ELASTICSEARCH_PORTS=9200,9200,9200

PIO_STORAGE_SOURCES_ELASTICSEARCH_SCHEMES=http

PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/usr/local/elasticsearch



Did somebody get this kind of error before? Any help or suggestion would be
appreciated.



Thanks,

VI Tran Tan Phong
This message contains information that may be privileged or confidential
and is the property of the Capgemini Group. It is intended only for the
person to whom it is addressed. If you are not the intended recipient, you
are not authorized to read, print, retain, copy, disseminate, distribute,
or use this message or any part thereof. If you receive this message in
error, please notify the sender immediately and delete all copies of this
message.

Re: UR 0.7.0 - problem with training

2018-03-08 Thread Pat Ferrel

BTW I think you may have to push setting on the cli by adding “spark” to
the beginning of the key name:

*pio train -- --conf spark.es.nodes=**“**localhost" --driver-memory 8g
--executor-memory 8g*


From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: March 8, 2018 at 11:04:55 AM
To: Wojciech Kowalski <wojci...@tomandco.co.uk> <wojci...@tomandco.co.uk>,
user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>, u...@predictionio.incubator.apache.org
<u...@predictionio.incubator.apache.org>
<u...@predictionio.incubator.apache.org>
Subject:  Re: UR 0.7.0 - problem with training

es.nodes is supposed to be a string with hostnames separated by commas.
Depending on how your containers are set to communicate with the outside
world (Docker networking or port mapping) you may also need to set the
port, which is 9200 by default.

If your container is using port mapping and maps the container port 9200 to
the localhost port of 9200 you should be ok with only setting hostnames in
engine.json.

es.nodes=“localhost”

But I suspect you didn’t set your container communication strategy because
this is the fallback that would have been tried with no valid setting.

If this is true look up how you set Docker to communicate, port mapping is
the simplest for a single all-in-one machine.


From: Wojciech Kowalski <wojci...@tomandco.co.uk> <wojci...@tomandco.co.uk>
Reply: user@predictionio.apache.org <user@predictionio.apache.org>
<user@predictionio.apache.org>
Date: March 8, 2018 at 7:31:10 AM
To: u...@predictionio.incubator.apache.org
<u...@predictionio.incubator.apache.org>
<u...@predictionio.incubator.apache.org>
Subject:  UR 0.7.0 - problem with training

Hello, i am trying to set new UR 0.7.0 with  predictionio 0.12.0 but all
atempts are failing.



I cannot set in engine’s spark config section „es.config” as i get such
error:

org.elasticsearch.index.mapper.MapperParsingException: object mapping for
[sparkConf.es.nodes] tried to parse field [es.nodes] as object, but found a
concrete value



If i don’t set this up engine fail to train because it cannot find
elasticsearch on localhost as it’s running on a separate machine

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[localhost:9200]]



Passing es.nodes via cli –conf es.nodes=elasticsearch doesn’t help either :/

*pio train -- --conf es.nodes=elasticsearch --driver-memory 8g
--executor-memory 8g*



Anyone would give any advice what am i doing wrong ?

I have separate docker containers for hadoop,hbase,elasticsearch,pio



Same setup was working fine on 0.10 and UR 0.5



Thanks,

Wojciech Kowalski

Re: UR 0.7.0 - problem with training

2018-03-08 Thread Pat Ferrel

es.nodes is supposed to be a string with hostnames separated by commas.
Depending on how your containers are set to communicate with the outside
world (Docker networking or port mapping) you may also need to set the
port, which is 9200 by default.

If your container is using port mapping and maps the container port 9200 to
the localhost port of 9200 you should be ok with only setting hostnames in
engine.json.

es.nodes=“localhost”

But I suspect you didn’t set your container communication strategy because
this is the fallback that would have been tried with no valid setting.

If this is true look up how you set Docker to communicate, port mapping is
the simplest for a single all-in-one machine.


From: Wojciech Kowalski  
Reply: user@predictionio.apache.org 

Date: March 8, 2018 at 7:31:10 AM
To: u...@predictionio.incubator.apache.org


Subject:  UR 0.7.0 - problem with training

Hello, i am trying to set new UR 0.7.0 with  predictionio 0.12.0 but all
atempts are failing.



I cannot set in engine’s spark config section „es.config” as i get such
error:

org.elasticsearch.index.mapper.MapperParsingException: object mapping for
[sparkConf.es.nodes] tried to parse field [es.nodes] as object, but found a
concrete value



If i don’t set this up engine fail to train because it cannot find
elasticsearch on localhost as it’s running on a separate machine

Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException:
Connection error (check network and/or proxy settings)- all nodes failed;
tried [[localhost:9200]]



Passing es.nodes via cli –conf es.nodes=elasticsearch doesn’t help either :/

*pio train -- --conf es.nodes=elasticsearch --driver-memory 8g
--executor-memory 8g*



Anyone would give any advice what am i doing wrong ?

I have separate docker containers for hadoop,hbase,elasticsearch,pio



Same setup was working fine on 0.10 and UR 0.5



Thanks,

Wojciech Kowalski

Re: Dynamically change parameter list

2018-02-12 Thread Pat Ferrel

That would be fine since the model can contain anything. But the real question 
is where you want to use those params. If you need to use them the next time 
you train, you’ll have to persist them to a place read during training. That is 
usually only the metadata store (obviously input events too), which has the 
contents of engine.json. So to get them into the metadata store you may have to 
alter engine.json. 

Unless someone else knows how to alter the metadata directly after `pio train`

One problem is that you will never know what the new params are without putting 
them in a file or logging them. We keep them in a separate place and merge them 
with engine.json explicitly so we can see what is happening. They are 
calculated parameters, not hand made tunings. It seems important to me to keep 
those separate unless you are talking about some type of expected reinforcement 
learning, not really params but an evolving model.

On Feb 12, 2018, at 2:48 PM, Tihomir Lolić <tihomir.lo...@gmail.com> wrote:

Thank you very much for the answer. I'll try with customizing workflow. There 
is a step where Seq of models is returned. My idea is to return model and model 
parameters in this step. I'll let you know if it works.

Thanks,
Tihomie

On Feb 12, 2018 23:34, "Pat Ferrel" <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
This is an interesting question. As we make more mature full featured engines 
they will begin to employ hyper parameter search techniques or reinforcement 
params. This means that there is a new stage in the workflow or a feedback loop 
not already accounted for.

Short answer is no, unless you want to re-write your engine.json after every 
train and probably keep the old one for safety. You must re-train to get the 
new params put into the metastore and therefor available to your engine.

What we do for the Universal Recommender is have a special new workflow phase, 
call it a self-tuning phase, where we search for the right tuning of 
parameters. This it done with code that runs outside of pio and creates 
parameters that go into the engine.json. This can be done periodically to make 
sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best architecture 
for you.

From: Tihomir Lolić <tihomir.lo...@gmail.com> <mailto:tihomir.lo...@gmail.com>
Reply: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
<user@predictionio.apache.org> <mailto:user@predictionio.apache.org>
Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org <mailto:user@predictionio.apache.org> 
<user@predictionio.apache.org> <mailto:user@predictionio.apache.org>
Subject:  Dynamically change parameter list 

> Hi,
> 
> I am trying to figure out how to dynamically update algorithm parameter list. 
> After the train is finished only model is updated. The reason why I need this 
> data to be updated is that I am creating data mapping based on the training 
> data. Is there a way to update this data after the train is done?
> 
> Here is the code that I am using. The variable that and should be updated 
> after the train is marked bold red.
> 
> import io.prediction.controller.{EmptyParams, EngineParams}
> import io.prediction.data.storage.EngineInstance
> import io.prediction.workflow.CreateWorkflow.WorkflowConfig
> import io.prediction.workflow._
> import org.apache.spark.ml.linalg.SparseVector
> import org.joda.time.DateTime
> import org.json4s.JsonAST._
> 
> import scala.collection.mutable
> 
> object TrainApp extends App {
> 
>   val envs = Map("FOO" -> "BAR")
> 
>   val sparkEnv = Map("spark.master" -> "local")
> 
>   val sparkConf = Map("spark.executor.extraClassPath" -> ".")
> 
>   val engineFactoryName = "LogisticRegressionEngine"
> 
>   val workflowConfig = WorkflowConfig(
> engineId = EngineConfig.engineId,
> engineVersion = EngineConfig.engineVersion,
> engineVariant = EngineConfig.engineVariantId,
> engineFactory = engineFactoryName
>   )
> 
>   val workflowParams = WorkflowParams(
> verbose = workflowConfig.verbosity,
> skipSanityCheck = workflowConfig.skipSanityCheck,
> stopAfterRead = workflowConfig.stopAfterRead,
> stopAfterPrepare = workflowConfig.stopAfterPrepare,
> sparkEnv = WorkflowParams().sparkEnv ++ sparkEnv
>   )
> 
>   WorkflowUtils.modifyLogging(workflowConfig.verbose)
> 
>   val dataSourceParams = DataSourceParams(sys.env.get("APP_NAME").get)
>   val preparatorParams = EmptyParams()
> 
>   val algorithmParamsList = Seq("Logistic" -> LogisticParams(columns = 
> Array[String](),
>

Re: Dynamically change parameter list

2018-02-12 Thread Pat Ferrel

This is an interesting question. As we make more mature full featured
engines they will begin to employ hyper parameter search techniques or
reinforcement params. This means that there is a new stage in the workflow
or a feedback loop not already accounted for.

Short answer is no, unless you want to re-write your engine.json after
every train and probably keep the old one for safety. You must re-train to
get the new params put into the metastore and therefor available to your
engine.

What we do for the Universal Recommender is have a special new workflow
phase, call it a self-tuning phase, where we search for the right tuning of
parameters. This it done with code that runs outside of pio and creates
parameters that go into the engine.json. This can be done periodically to
make sure the tuning is still optimal.

Not sure whether feedback or hyper parameter search is the best
architecture for you.


From: Tihomir Lolić  
Reply: user@predictionio.apache.org 

Date: February 12, 2018 at 2:02:48 PM
To: user@predictionio.apache.org 

Subject:  Dynamically change parameter list

Hi,

I am trying to figure out how to dynamically update algorithm parameter
list. After the train is finished only model is updated. The reason why I
need this data to be updated is that I am creating data mapping based on
the training data. Is there a way to update this data after the train is
done?

Here is the code that I am using. The variable that and should be updated
after the train is marked *bold red.*

import io.prediction.controller.{EmptyParams, EngineParams}
import io.prediction.data.storage.EngineInstance
import io.prediction.workflow.CreateWorkflow.WorkflowConfig
import io.prediction.workflow._
import org.apache.spark.ml.linalg.SparseVector
import org.joda.time.DateTime
import org.json4s.JsonAST._

import scala.collection.mutable

object TrainApp extends App {

  val envs = Map("FOO" -> "BAR")

  val sparkEnv = Map("spark.master" -> "local")

  val sparkConf = Map("spark.executor.extraClassPath" -> ".")

  val engineFactoryName = "LogisticRegressionEngine"

  val workflowConfig = WorkflowConfig(
engineId = EngineConfig.engineId,
engineVersion = EngineConfig.engineVersion,
engineVariant = EngineConfig.engineVariantId,
engineFactory = engineFactoryName
  )

  val workflowParams = WorkflowParams(
verbose = workflowConfig.verbosity,
skipSanityCheck = workflowConfig.skipSanityCheck,
stopAfterRead = workflowConfig.stopAfterRead,
stopAfterPrepare = workflowConfig.stopAfterPrepare,
sparkEnv = WorkflowParams().sparkEnv ++ sparkEnv
  )

  WorkflowUtils.modifyLogging(workflowConfig.verbose)

  val dataSourceParams = DataSourceParams(sys.env.get("APP_NAME").get)
  val preparatorParams = EmptyParams()

  *val algorithmParamsList = Seq("Logistic" -> LogisticParams(columns =
Array[String](),*
*  dataMapping
= Map[String, Map[String, SparseVector]]()))*
  val servingParams = EmptyParams()

  val engineInstance = EngineInstance(
id = "",
status = "INIT",
startTime = DateTime.now,
endTime = DateTime.now,
engineId = workflowConfig.engineId,
engineVersion = workflowConfig.engineVersion,
engineVariant = workflowConfig.engineVariant,
engineFactory = workflowConfig.engineFactory,
batch = workflowConfig.batch,
env = envs,
sparkConf = sparkConf,
dataSourceParams =
JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> dataSourceParams),
preparatorParams =
JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> preparatorParams),
algorithmsParams =
JsonExtractor.paramsToJson(workflowConfig.jsonExtractor,
algorithmParamsList),
servingParams = JsonExtractor.paramToJson(workflowConfig.jsonExtractor,
workflowConfig.engineParamsKey -> servingParams)
  )

  val (engineLanguage, engineFactory) =
WorkflowUtils.getEngine(engineInstance.engineFactory,
getClass.getClassLoader)

  val engine = engineFactory()

  val engineParams = EngineParams(
dataSourceParams = dataSourceParams,
preparatorParams = preparatorParams,
algorithmParamsList = algorithmParamsList,
servingParams = servingParams
  )

  val engineInstanceId = CreateServer.engineInstances.insert(engineInstance)

  CoreWorkflow.runTrain(
env = envs,
params = workflowParams,
engine = engine,
engineParams = engineParams,
engineInstance = engineInstance.copy(id = engineInstanceId)
  )

  CreateServer.actorSystem.shutdown()
}


Thank you,
Tihomir

Re: pio train on Amazon EMR

2018-02-05 Thread Pat Ferrel

I agree, we looked at using EMR and found that we liked some custom Terraform +
Docker much better. The existing EMR defined by AWS requires refactoring PIO or
using it in yarn’s cluster mode. EMR is not meant to host any application code
except what is sent into Spark in serialized form. However PIO expects to run
the Spark “Driver” in the PIO process, which means on the PIO server machine.

It is possible to make PIO use yarn’s cluster mode to serialize the “Driver”
too but this is fairly complicated. I think I’ve seen Donald explain it before
but we chose not to do this. For one thing optimizing and tuning yarn managed
Spark changes the meaning of some tuning parameters.

Spark is moving to Kubernetes as a replacement for Yarn so we are quite
interested in following that line of development.

One last thought on EMR: It was designed originally for Hadoop’s MapReduce.
That meant that for a long time you couldn’t get big memory machines in EMR
(you can now). So the EMR team in AWS does not seem to target Spark or other
clustered services as well as they could. This is another reason we decided it
wasn’t worth the trouble.

From: Mars Hall
Reply: user@predictionio.apache.org
Date: February 5, 2018 at 11:45:46 AM
To: user@predictionio.apache.org
Subject: Re: pio train on Amazon EMR

Hi Malik,

This is a topic I've been investigating as well.

Given how EMR manages its clusters & their runtime, I don't think hacking
configs to make the PredictionIO host act like a cluster member will be a
simple or sustainable approach.

PredictionIO already operates Spark by building `spark-submit` commands.
https://github.com/apache/predictionio/blob/df406bf92463da4a79c8d84ec0ca439feaa0ec7f/tools/src/main/scala/org/apache/predictionio/tools/Runner.scala#L313

Implementing a new AWS EMR command runner in PredictionIO, so that we can
switch `pio train` from the existing, plain `spark-submit` command to using the
AWS CLI, `aws emr add-steps --steps Args=spark-submit` would likely solve a big
part of this problem.
https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

Also, uploading the engine assembly JARs (the job code to run on Spark) to the
cluster members or S3 for access from the EMR Spark runtime will be another
part of this challenge.

On Mon, Feb 5, 2018 at 5:29 AM, Malik Twain wrote:
I'm trying to run pio train with Amazon EMR. I copied core-site.xml and
yarn-site.xml from EMR to my training machine, and configured HADOOP_CONF_DIR
in pio-env.sh accordingly.

I'm running pio train as below:

pio train -- --master yarn --deploy-mode cluster

It's failing with the following errors:

18/02/05 11:56:15 INFO Client:
client token: N/A
diagnostics: Application application_1517819705059_0007 failed 2 times due
to AM Container for appattempt_1517819705059_0007_02 exited with exitCode:
1
Diagnostics: Exception from container-launch.

And below are the errors from EMR stdout and stderr respectively:

java.io.FileNotFoundException: /root/pio.log (Permission denied)
[ERROR] [CreateWorkflow$] Error reading from file: File
file:/quickstartapp/MyExample/engine.json does not exist. Aborting workflow.

Thank you.

--
*Mars Hall
415-818-7039
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California

Re: Frequent Pattern Mining - No engine found. Your build might have failed. Aborting.

2018-02-01 Thread Pat Ferrel

This list is for support of ActionML products, not general PIO support. You can 
get that on the Apache PIO user mailing list, where I have forwarded this 
question.

Several uses of FPM are supported by the Universal Recommender, such as 
Shopping cart recommendations. That is a template we support.


From: dee...@infosoftni.com 
Date: February 1, 2018 at 2:51:01 AM
To: actionml-user 
Subject:  Frequent Pattern Mining - No engine found. Your build might have 
failed. Aborting.  

I am using Frequent pattern mining template and got following error. No engine 
found. 

Please advice. 


s5@AMOL-PATIL:~/Documents/DataSheet/Templates/pio-template-fpm$ pio build 
--verbose
[INFO] [Engine$] Using command 
'/home/s5/Documents/DataSheet/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/sbt/sbt'
 at /home/s5/Documents/DataSheet/Templates/pio-template-fpm to build.
[INFO] [Engine$] If the path above is incorrect, this process will fail.
[INFO] [Engine$] Uber JAR disabled. Making sure 
lib/pio-assembly-0.12.0-incubating.jar is absent.
[INFO] [Engine$] Going to run: 
/home/s5/Documents/DataSheet/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/sbt/sbt
  package assemblyPackageDependency in 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm
[INFO] [Engine$] [info] Loading project definition from 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm/project
[INFO] [Engine$] [info] Set current project to pio-template-text-clustering (in 
build file:/home/s5/Documents/DataSheet/Templates/pio-template-fpm/)
[INFO] [Engine$] [success] Total time: 1 s, completed 1 Feb, 2018 4:13:41 PM
[INFO] [Engine$] [info] Including from cache: scala-library.jar
[INFO] [Engine$] [info] Checking every *.class/*.jar file's SHA-1.
[INFO] [Engine$] [info] Merging files...
[INFO] [Engine$] [warn] Merging 'META-INF/MANIFEST.MF' with strategy 'discard'
[INFO] [Engine$] [warn] Strategy 'discard' was applied to a file
[INFO] [Engine$] [info] Assembly up to date: 
/home/s5/Documents/DataSheet/Templates/pio-template-fpm/target/scala-2.10/pio-template-text-clustering-assembly-0.1-SNAPSHOT-deps.jar
[INFO] [Engine$] [success] Total time: 1 s, completed 1 Feb, 2018 4:13:42 PM
[INFO] [Engine$] Compilation finished successfully.
[INFO] [Engine$] Looking for an engine...
[ERROR] [Engine$] No engine found. Your build might have failed. Aborting.
s5@AMOL-PATIL:~/Documents/DataSheet/Templates/pio-template-fpm$


--
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com.
To post to this group, send email to actionml-u...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/f193dd54-85a7-4598-88fe-fb7c74644f11%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: PIO error

2018-01-23 Thread Pat Ferrel

Unfortunately I can’t possibly guess without more information.

What do the logs say when pio cannot be started? Are all these pio
instances separate, not in a cluster? In other words does each pio server
have all necessary services running on them? I assume none is sleeping like
a laptop does?

I you are worrying, when properly configured PIO is quite stable on servers
that do not sleep. I have never seen a bug that would cause this and have
installed it hundreds of time so lets look through logs and check your
pio-env.sh on a particular machine that is having this problem.


From: bala vivek  
Date: January 22, 2018 at 11:32:17 PM
To: actionml-user 

Subject:  Re: PIO error

Hi Pat,

The PIO has installed on the Ubuntu server, the Dev server and
production servers are hosted in other countries and we are connecting
through VPN from my laptop.
And yes if I do a pio-start-all and pio-stop-all resolves the issue always,
but this issue is re-occurring often and sometimes the PIO service is not
coming up even after multiple Pio restart.

Not sure with the core reason why the service is often getting down.

Regards,
Bala

On Tuesday, January 23, 2018 at 2:47:26 AM UTC+5:30, pat wrote:
>
> If you are using a laptop for a dev machine, when it sleeps it can
> interfere with Zookeeper, which is started and used by HBase. So
> pio-stop-all then pio-start-all restarts HBase and therefor Zookeeper
> gracefully to solve this.
>
> Does the stop/start always solve this?
>
>
>
> From: bala vivek 
> Date: January 21, 2018 at 10:39:31 PM
> To: actionml-user 
> Subject:  PIO error
>
> Hi,
>
> I'm getting the following error in pio.
>
> pio status gives me the below result,
>
> [INFO] [Console$] Inspecting PredictionIO...
> [INFO] [Console$] PredictionIO 0.10.0-incubating is installed at
> /opt/tools/PredictionIO-0.10.0-incubating
> [INFO] [Console$] Inspecting Apache Spark...
> [INFO] [Console$] Apache Spark is installed at
> /opt/tools/PredictionIO-0.10.0-incubating/vendors/spark-1.
> 6.3-bin-hadoop2.6
> [INFO] [Console$] Apache Spark 1.6.3 detected (meets minimum requirement
> of 1.3.0)
> [INFO] [Console$] Inspecting storage backend connections...
> [INFO] [Storage$] Verifying Meta Data Backend (Source: ELASTICSEARCH)...
> [INFO] [Storage$] Verifying Model Data Backend (Source: LOCALFS)...
> [INFO] [Storage$] Verifying Event Data Backend (Source: HBASE)...
> [INFO] [Storage$] Test writing to Event Store (App Id 0)...
> [ERROR] [Console$] Unable to connect to all storage backends successfully.
> The following shows the error message from the storage backend.
> [ERROR] [Console$] Failed after attempts=1, exceptions:
> Mon Jan 22 01:00:02 EST 2018, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@5c5d6175, org.apache.hadoop.hbase.ipc.
> RemoteWithExtrasException(org.apache.hadoop.hbase.PleaseHoldException):
> org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
>at org.apache.hadoop.hbase.master.HMaster.
> checkInitialized(HMaster.java:2293)
>at org.apache.hadoop.hbase.master.HMaster.checkNamespaceManagerReady(
> HMaster.java:2298)
>at org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(
> HMaster.java:2536)
>at org.apache.hadoop.hbase.master.MasterRpcServices.
> listNamespaceDescriptors(MasterRpcServices.java:1100)
>at org.apache.hadoop.hbase.protobuf.generated.
> MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:55734)
>at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2180)
>at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
>at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(
> RpcExecutor.java:133)
>at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
>at java.lang.Thread.run(Thread.java:748)
>
>  (org.apache.hadoop.hbase.client.RetriesExhaustedException)
> [ERROR] [Console$] Dumping configuration of initialized storage backend
> sources. Please make sure they are correct.
> [ERROR] [Console$] Source Name: ELASTICSEARCH; Type: elasticsearch;
> Configuration: TYPE -> elasticsearch, HOME -> /opt/tools/PredictionIO-0.10.
> 0-incubating/vendors/elasticsearch-1.7.3
> [ERROR] [Console$] Source Name: LOCALFS; Type: localfs; Configuration:
> PATH -> /root/.pio_store/models, TYPE -> localfs
> [ERROR] [Console$] Source Name: HBASE; Type: hbase; Configuration: TYPE ->
> hbase, HOME -> /opt/tools/PredictionIO-0.10.0-incubating/vendors/hbase-1.
> 2.4
>
>
> This setup is running in our production and this is not a new setup. Often
> I get this error and if do a pio-stop-all and pio-start-all, pio will work
> fine.
> But why often the pio status is showing error. There was no new
> configuration changes made in the pio-envi.sh file
> --
> You received this message because you are subscribed to the Google Groups
> "actionml-user" group.
> To

Re: Prediction IO install failed in Linux

2018-01-23 Thread Pat Ferrel

This would be very difficult to do. Even if you used a machine connected to
the internet to download things like pio, spark, etc. the very build tools
used (sbt) expect to be able to get code from various repositories on the
internet. To build templates would further complicate this since each
template may have different needs.

Perhaps you can take a laptop home, install and build, take it back to work
with all needed code installed. In order to use open source software it is
virtually impossible to work without access to the internet.

From: Praveen Prasannakumar 

Reply: user@predictionio.apache.org 

Date: January 23, 2018 at 7:03:27 AM
To: user@predictionio.apache.org 

Subject:  Re: Prediction IO install failed in Linux

Team - Is there a way to install predictio io offline ? If yes , Can
someone provide some documents for it ?

Thanks
Praveen

On Fri, Jan 19, 2018 at 11:05 AM, Praveen Prasannakumar <
praveen2399wo...@gmail.com> wrote:

> Hello Team
>
> I am trying to install prediction IO in one of our linux box with in
> office network. My company network have firewall and sometimes it wont
> connect to outside servers. I am not sure whether that is the reason on
> failure while executing make-distribution.sh script. Can you please help me
> to figure out how can I install prediction IO with in my office network ?
>
> Attaching the screenshot with error.
>
> 
>
> Thanks
> Praveen
>

ii_jclhr7su0_1610ce9f32410c38
Description: Binary data

Re: Need Help Setting up prediction IO

2018-01-17 Thread Pat Ferrel

PIO uses Postgres, MySQL or other JDBC database from the SQL DBs or (and I 
always use this) HBase. Hbase is a high performance NoSQL DB that scales 
indefinitely.

It is possible to use any DB if you write an EventStore class for it, wrapping 
the DB calls with a virtualization API that is DB independent.

Memory is completely algorithm and data dependent but expect PIO, which uses 
Sparkm which in turn gets it’s speed from keeping data in-memory, to use a lot 
compared to a web server. PIO apps are often in the big data category and many 
deployments require Spark clusters with many G per machine. It is rare to be 
able to run PIO in production on a single machine.

Welcome to big data.


On Jan 11, 2018, at 6:23 PM, Rajesh Jangid <raje...@grazitti.com> wrote:

Hi, 
Well with version PIO 10 I think some dependency is causing trouble in 
linux, we have figured out a way using Pio for now, and everything is working 
great. 
  Thanks for the support though. 

Few question-
1.Does Pio latest support Mongodb or NoSQL?
2.Memory uses by Pio, Is there any max memory limit set, If need be can it be 
set? 


Thanks
Rajesh 


On Jan 11, 2018 10:25 PM, "Pat Ferrel" <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
The version in the artifact built by Scala should only have the major version 
number so 2.10 or 2.11. PIO 0.10.0 needs 2.10.  Where, and what variable did 
you set to 2.10.4? That is the problem. There will never be a lib built for 
2.10.4, it will always be 2.10.



On Jan 11, 2018, at 5:15 AM, Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:

Basically you need to make sure all your lib dependencies in build.sbt work 
together. 

On Thu, 11 Jan 2018 at 13:14 Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:
Maybe try v2.10.4 based on this line:

[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4

I'm unfamiliar with the ubuntu setup for pio so can't help you there I'm afraid.

On Thu, 11 Jan 2018 at 05:08 Rajesh Jangid <raje...@grazitti.com 
<mailto:raje...@grazitti.com>> wrote:
I am trying to run this on ubuntu 16.04

On Thu, Jan 11, 2018 at 10:36 AM, Rajesh Jangid <raje...@grazitti.com 
<mailto:raje...@grazitti.com>> wrote:
Hi, 
  I have tried once again with 2.10 as well but getting following dependency 
error

[INFO] [Console$] [error] Modules were resolved with conflicting cross-version 
suffixes in 
{file:/home/integration/client/PredictionIO-0.10/Engines/MyRecommendation/}myrecommendation:
[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4
[INFO] [Console$] java.lang.RuntimeException: Conflicting cross-version 
suffixes in: com.chuusai:shapeless
[INFO] [Console$] at scala.sys.package$.error(package.scala:27)
[INFO] [Console$] at 
sbt.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:46)
[INFO] [Console$] at sbt.ConflictWarning$.apply(ConflictWarning.scala:32)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1300)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1297)
[INFO] [Console$] at 
scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
[INFO] [Console$] at 
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
[INFO] [Console$] at sbt.std.Transform$$anon$4.work 
<http://4.work/>(System.scala:63)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
[INFO] [Console$] at sbt.Execute.work 
<http://sbt.execute.work/>(Execute.scala:237)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
[INFO] [Console$] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[INFO] [Console$] at java.lang.Thread.run(Thread.java:745)
[INFO] [Console$] [error] (*:update) Conflicting cross-version suffixes in: 
com.chuusai:shapeless
[INFO] [Console$] [error] Total time: 6 s, completed Jan 11, 2018 5:03:51 AM
[ERROR] [Console$] Return code of previous step i

The Universal Recommender v0.7.0

2018-01-17 Thread Pat Ferrel

We have been waiting to release the UR v0.7.0 for testing (done) and the 
release of Mahout v0.13.1 (not done) Today we have released the UR v0.7.0 
anyway. This comes with:
Support for PIO v0.12.0
Requires Scala 2.11 (can be converted to use Scala 2.10 but it’s a manual 
process)
Requires Elasticsearch 5.X, and uses the REST client exclusively. This enables 
Elasticsearch authentication if needed.
Speed improvements for queries (ES 5.x is faster) and model building (a 
snapshot build of Mahout includes speedups)
Requires a source build of Mahout from a version forked by ActionML. This 
requirement will be removed as soon as Mahout releases v0.13.1, which will be 
incorporated in UR v0.7.1 asap. Follow special build instructions in the UR’s 
README.md.
Fixes a bug in the business rules for excluding items with certain properties

Report issues on the GitHub repo here: 
https://github.com/actionml/universal-recommender 
 get tag v0.7.0 for `pio 
build` and be sure to read the instructions and warnings on the README.md there.

Ask questions on the Google Group here: 
https://groups.google.com/forum/#!forum/actionml-user 
 or on the PIO user list.

Re: Need Help Setting up prediction IO

2018-01-11 Thread Pat Ferrel

The version in the artifact built by Scala should only have the major version 
number so 2.10 or 2.11. PIO 0.10.0 needs 2.10.  Where, and what variable did 
you set to 2.10.4? That is the problem. There will never be a lib built for 
2.10.4, it will always be 2.10.

On Jan 11, 2018, at 5:15 AM, Daniel O' Shaughnessy  
wrote:

Basically you need to make sure all your lib dependencies in build.sbt work 
together. 

On Thu, 11 Jan 2018 at 13:14 Daniel O' Shaughnessy > wrote:
Maybe try v2.10.4 based on this line:

[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4

I'm unfamiliar with the ubuntu setup for pio so can't help you there I'm afraid.

On Thu, 11 Jan 2018 at 05:08 Rajesh Jangid > wrote:
I am trying to run this on ubuntu 16.04

On Thu, Jan 11, 2018 at 10:36 AM, Rajesh Jangid > wrote:
Hi, 
  I have tried once again with 2.10 as well but getting following dependency 
error

[INFO] [Console$] [error] Modules were resolved with conflicting cross-version 
suffixes in 
{file:/home/integration/client/PredictionIO-0.10/Engines/MyRecommendation/}myrecommendation:
[INFO] [Console$] [error]com.chuusai:shapeless _2.10, _2.10.4
[INFO] [Console$] java.lang.RuntimeException: Conflicting cross-version 
suffixes in: com.chuusai:shapeless
[INFO] [Console$] at scala.sys.package$.error(package.scala:27)
[INFO] [Console$] at 
sbt.ConflictWarning$.processCrossVersioned(ConflictWarning.scala:46)
[INFO] [Console$] at sbt.ConflictWarning$.apply(ConflictWarning.scala:32)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1300)
[INFO] [Console$] at sbt.Classpaths$$anonfun$100.apply(Defaults.scala:1297)
[INFO] [Console$] at 
scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
[INFO] [Console$] at 
sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
[INFO] [Console$] at sbt.std.Transform$$anon$4.work 
(System.scala:63)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
[INFO] [Console$] at sbt.Execute.work 
(Execute.scala:237)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
[INFO] [Console$] at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
[INFO] [Console$] at 
sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[INFO] [Console$] at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[INFO] [Console$] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[INFO] [Console$] at java.lang.Thread.run(Thread.java:745)
[INFO] [Console$] [error] (*:update) Conflicting cross-version suffixes in: 
com.chuusai:shapeless
[INFO] [Console$] [error] Total time: 6 s, completed Jan 11, 2018 5:03:51 AM
[ERROR] [Console$] Return code of previous step is 1. Aborting.

On Wed, Jan 10, 2018 at 10:03 PM, Daniel O' Shaughnessy 
> wrote:
I've pulled down this version without any modifications and run with pio v0.10 
on a mac and it builds with no issues.

However, when I add in scalaVersion := "2.11.8" to build.sbt I get a dependency 
error.

pio v0.10 supports scala 2.10 so you need to switch to this to run! 

On Wed, 10 Jan 2018 at 13:47 Rajesh Jangid > wrote:
Yes, v0.5.0

On Jan 10, 2018 7:07 PM, "Daniel O' Shaughnessy" > wrote:
Is this the template you're using? 

https://github.com/apache/predictionio-template-ecom-recommender 

On Wed, 10 Jan 2018 at 13:16 Rajesh Jangid > wrote:
Yes, 
We have dependency with elastic and we have elastic 1.4.4 already running. 
We Do not want to run another elastic instance.
Latest prediction IO does not support elastic 1.4.4

On Wed, Jan 10, 2018 at 6:25 PM, Daniel O' Shaughnessy 
> wrote:
Strangedo you absolutely need to run this with pio v0.10? 

On Wed, 10 Jan 2018 at 12:50 Rajesh Jangid

Re: Using Dataframe API vs. RDD API?

2018-01-05 Thread Pat Ferrel

Yes and I do not recommend that because the EventServer schema is not a 
developer contract. It may change at any time. Use the conversion method and go 
through the PIO API to get the RDD then convert to DF for now.

I’m not sure what PIO uses to get an RDD from Postgres but if they do not use 
something like the lib you mention, a PR would be nice. Also if you have an 
interest in adding the DF APIs to the EventServer contributions are encouraged. 
Committers will give some guidance I’m sure—once that know more than me on the 
subject.

If you want to donate some DF code, create a Jira and we’ll easily find a 
mentor to make suggestions. There are many benefits to this including not 
having to support a fork of PIO through subsequent versions. Also others are 
interested in this too.

 

On Jan 5, 2018, at 7:39 AM, Daniel O' Shaughnessy <danieljamesda...@gmail.com> 
wrote:

Should have mentioned that I used org.apache.spark.rdd.JdbcRDD to read in 
the RDD from a postgres DB initially.

This was you don't need to use an EventServer!

On Fri, 5 Jan 2018 at 15:37 Daniel O' Shaughnessy <danieljamesda...@gmail.com 
<mailto:danieljamesda...@gmail.com>> wrote:
Hi Shane, 

I've successfully used : 

import org.apache.spark.ml.classification.{ RandomForestClassificationModel, 
RandomForestClassifier }

with pio. You can access feature importance through the RandomForestClassifier 
also.

Very simple to convert RDDs to DFs as Pat mentioned, something like:

val RDD_2_DF = sqlContext.createDataFrame(yourRDD).toDF("col1", "col2")



On Thu, 4 Jan 2018 at 23:10 Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohn...@gmail.com 
<mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350 <tel:(801)%20360-3350>
LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>

Re: Using Dataframe API vs. RDD API?

2018-01-04 Thread Pat Ferrel

Actually there are libs that will read DFs from HBase 
https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html
 
<https://svn.apache.org/repos/asf/hbase/hbase.apache.org/trunk/_chapters/spark.html>

This is out of band with PIO and should not be used IMO because the schema of 
the EventStore is not guaranteed to remain as-is. The safest way is to 
translate or get DFs integrated to PIO. I think there is an existing Jira that 
request Spark ML support, which assumes DFs. 


On Jan 4, 2018, at 12:25 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson <shanewaldenjohn...@gmail.com 
<mailto:shanewaldenjohn...@gmail.com>> wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook 
<https://www.facebook.com/shane.johnson.71653>

Re: Using Dataframe API vs. RDD API?

2018-01-04 Thread Pat Ferrel

Funny you should ask this. Yes, we are working on a DF based Universal 
Recommender but you have to convert the RDD into a DF since PIO does not read 
out data in the form of a DF (yet). This is a fairly simple step of maybe one 
line of code but would be better supported in PIO itself. The issue is that the 
EventStore uses libs that may not read out DFs, but RDDs. This is certainly the 
case with Elasticsearch, which provides an RDD lib. I haven’t seen one from 
them that read out DFs though it would make a lot of sense for ES especially.

So TLDR; yes, just convert the RDD into a DF for now.

Also please add a feature request as a PIO Jira ticket to look into this. I for 
one would +1


On Jan 4, 2018, at 11:55 AM, Shane Johnson  wrote:

Hello group, Happy new year! Does anyone have a working example or template 
using the DataFrame API vs. the RDD based APIs. We are wanting to migrate to 
using the new DataFrame APIs to take advantage of the Feature Importance 
function for our Regression Random Forest Models.

We are wanting to move from 

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
to
import org.apache.spark.ml.regression.{RandomForestRegressionModel, 
RandomForestRegressor}

Is this something that should be fairly straightforward by adjusting parameters 
and calling new classes within DASE or is it much more involved development.

Thank You!
Shane Johnson | 801.360.3350

LinkedIn  | Facebook

Re: Error: "unable to undeploy"

2018-01-03 Thread Pat Ferrel

The UR does not require more than one deploy (assuming the server runs 
forever). Retraining the UR automatically re-deploys the new model. 

All other Engines afaik do require retrain-redeploy.

Users should be aware that PIO is a framework that provides no ML function 
whatsoever. It supports a workflow but Engines are free to simplify or use it 
in different ways so always preface a question with what Engine you are using 
or asking about.

On Jan 3, 2018, at 4:33 AM, Noelia Osés Fernández  wrote:

Hi lokotochek,

You mentioned that it wasn't necessary to redeploy after retraining. However, 
today I have come across a PIO wepage that I hadn't seen before that tells me 
to redeploy after retraining (section 'Update Model with New Data'):

http://predictionio.incubator.apache.org/deploy/ 

Particularly, this page suggests adding the following line to the crontab to 
retrain every day:

0 0 * * *   $PIO_HOME/bin/pio train; $PIO_HOME/bin/pio deploy

Here it is clear that it is redeploying after retraining. So does it not 
actually hot-swap the model? Or the UR does but this page is more general for 
other templates 
that might not do that?

Thank for your help!

On 14 December 2017 at 15:57, Александр Лактионов > wrote:
Hi Noelia,
you dont have to redeploy your app after train. It will be hot-swapped and the 
previous procces (ran by pio deploy) will change recommendations automatically
> 14 дек. 2017 г., в 17:56, Noelia Osés Fernández  > написал(а):
> 
> Hi,
> 
> The first time after reboot that I train and deploy my PIO app everything 
> works well. However, if I then retrain and deploy again, I get the following 
> error: 
> 
> [INFO] [MasterActor] Undeploying any existing engine instance at 
> http://0.0.0.0:8000 
> [ERROR] [MasterActor] Another process might be occupying 0.0.0.0:8000 
> . Unable to undeploy.
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (2 more trial(s))
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [ERROR] [MasterActor] Bind failed. Retrying... (1 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Retrying... (0 more trial(s))
> [ERROR] [TcpListener] Bind failed for TCP channel on endpoint [/0.0.0.0:8000 
> ]
> [WARN] [HttpListener] Bind to /0.0.0.0:8000  failed
> [ERROR] [MasterActor] Bind failed. Shutting down.
> 
> I thought it was possible to retrain an app that was running and then deploy 
> again.
> Is this not possible?
> 
> How can I kill the running instance?
> I've tried the trick in handmade's integration test but it doesn't work:
> 
> deploy_pid=`jps -lm | grep "onsole deploy" | cut -f 1 -d ' '`
> echo "Killing the deployed test PredictionServer"
> kill "$deploy_pid"
> 
> I still get the same error after doing this.
> 
> Any help is much appreciated.
> Best regards,
> Noelia
> 
> 
> 
> 
> 
> 
>

Re: App still returns results after pio app data-delete

2018-01-02 Thread Pat Ferrel

BTW there is a new Chrome extension that lets you browse ES and create any JSON 
query. Just found it myself after Sense stopped working in Chrome. Try 
ElasticSearch Head, found in the Chrome store.


On Jan 2, 2018, at 9:53 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

Have a look at the ES docs on their site. There are several ways, from sending 
a JSON command to deleting the data directory depending on how clean you want 
ES to be.

In general my opinion is that PIO is an integration framework for several 
services and for the majority of applications you will not need to deal 
directly with the services except for setup. This may be an exception. In all 
cases you may find it best to seek guidance from the support communities or 
docs of those services.

If you are sending a REST JSON command directive it would be as shown here: 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html>

$ curl -XDELETE 'http://localhost:9200// 
<http://localhost:9200/%3Cindex_name%3E/>'

The Index name is named in the UR engine.json or in pio-env depending on which 
index you want to delete.


On Jan 2, 2018, at 12:22 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Thanks for the explanation!

How do I delete the ES index? is it just DELETE /my_index_name?

Happy New Year!

On 22 December 2017 at 19:42, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)


On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json <http://localhost:8000/queries.json>

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher | Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos 
para Energía y Procesos Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: App still returns results after pio app data-delete

2018-01-02 Thread Pat Ferrel

Have a look at the ES docs on their site. There are several ways, from sending 
a JSON command to deleting the data directory depending on how clean you want 
ES to be.

In general my opinion is that PIO is an integration framework for several 
services and for the majority of applications you will not need to deal 
directly with the services except for setup. This may be an exception. In all 
cases you may find it best to seek guidance from the support communities or 
docs of those services.

If you are sending a REST JSON command directive it would be as shown here: 
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-delete-index.html>

$ curl -XDELETE 'http://localhost:9200//'

The Index name is named in the UR engine.json or in pio-env depending on which 
index you want to delete.


On Jan 2, 2018, at 12:22 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks for the explanation!

How do I delete the ES index? is it just DELETE /my_index_name?

Happy New Year!

On 22 December 2017 at 19:42, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
With PIO the model is managed by the user, not PIO. The input is separate and 
can be deleted without affecting the model.

Each Engine handles model’s it’s own way but most use the model storage in 
pio-env. So deleting those will get rid of the model. The UR keeps the model in 
ES under the “indexName” and “typeName” in engine.json. So you need to delete 
the index if you want to stop queries from working. The UR maintain’s one live 
copy of the model and removes old ones after a new one is made live so there 
will only ever be one model (unless you have changed your indexName often)


On Dec 21, 2017, at 4:58 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Hi all!

I have executed a pio app data-delete MyApp. The command has outputted the 
following:

[INFO] [HBLEvents] Removing table pio_event:events_4...
[INFO] [App$] Removed Event Store for the app ID: 4
[INFO] [HBLEvents] The table pio_event:events_4 doesn't exist yet. Creating 
now...
[INFO] [App$] Initialized Event Store for the app ID: 4


However, I executed

curl -H "Content-Type: application/json" -d '
{
}' http://localhost:8000/queries.json <http://localhost:8000/queries.json>

after deleting the data and I still get the same results as before deleting the 
data. Why is this happening?

I expected to get either an error message or an empty result like 
{"itemScores":[]}.

Any help is much appreciated.
Best regards,
Noelia




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher | Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos 
para Energía y Procesos Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Recommendation return score more than 5

2017-12-22 Thread Pat Ferrel

I did not write the template you are using. I am trying to explain what the
template should be doing and how ALS works. I’m sure that with exactly the same
data you should get the same results but in real life you will need to
understand the algorithm a little deeper and so the pointer to the code that is
being executed by the template from Spark MLlib. If this is not helpful please
ignore the advice.

On Dec 22, 2017, at 11:16 AM, GMAIL <babaevka...@gmail.com> wrote:

But I strictly followed the instructions from the site and did not change
anything even. Everything I did was steps from this page. I did not perform any
additional operations, including editing the source code.

Instruction (Quick Start - Recommendation Engine Template):
http://predictionio.incubator.apache.org/templates/recommendation/quickstart/
<http://predictionio.incubator.apache.org/templates/recommendation/quickstart/>

2017-12-22 22:12 GMT+03:00 Pat Ferrel <p...@occamsmachete.com
<mailto:p...@occamsmachete.com>>:
Implicit means you assign a score to the event based on your own guess.
Explicit uses ratings the user makes. One score is a guess by you (like a 4 for
buy) and the other is a rating made by the user. ALS comes in 2 flavors, one
for explicit scoring, used to predict rating and the other for implicit scoring
used to predict something the user will prefer.

Make sure your template is using the explicit version of ALS.
https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback

<https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback>

On Dec 21, 2017, at 11:09 PM, GMAIL <babaevka...@gmail.com
<mailto:babaevka...@gmail.com>> wrote:

I wanted to use the Recomender because I expected that it could predict the
scores as it is done by MovieLens. And it seems to be doing so, but for some
reason the input and output scale is different. In imported scores, from 1 to
5, and in the predicted from 1 to 10.

If by implicit scores you mean events without parameters, then I am aware that
in essence there is also an score. I watched the DataSource in Recommender and
there were only two events: rate and buy. Rate takes an score, and the buy
implicitly puts the rating at 4 (out of 5, as I think).

And I still did not understand exactly where to look for me and what to
correct, so that incoming and predicted estimates were on the same scale.

2017-12-19 4:10 GMT+03:00 Pat Ferrel <p...@occamsmachete.com
<mailto:p...@occamsmachete.com>>:
There are 2 types of MLlib ALS recommenders last I checked, implicit and
explicit. Implicit ones you give any arbitrary score, like a 1 for purchase.
The explicit one you can input ratings and it is expected to predict ratings
for an individual. But both iirc also have a regularization parameter that
affects the scoring and is a param so you have to experiment with it using
cross-validation to see where you get the best results.

There is an old metric used for this type of thing called RMSE
(root-mean-square error) which, when minimized will give you scores that most
closely match actual scores in the hold-out set (see wikipedia on
cross-validation and RMSE). You may have to use explicit ALS and tweak the
regularization param, to get the lowest RMSE. I doubt anything will guarantee
them to be in exactly the range of ratings so you’ll then need to pick the
closest rating.

On Dec 18, 2017, at 10:42 AM, GMAIL <babaevka...@gmail.com
<mailto:babaevka...@gmail.com>> wrote:

That is, the predicted scores that the Recommender returns can not just be
multiplied by two, but may be completely wrong?
I can not, say, just divide the predictions by 2 and pretend that everything is
fine?

2017-12-18 21:35 GMT+03:00 Pat Ferrel <p...@occamsmachete.com
<mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology
underneath.

In general the scores you get from recommenders are meaningless on their own.
When using ratings as numerical values with a ”Matrix Factorization”
recommender like the ones in MLlib, upon which the Recommendations Template is
based need to have a regularization parameter. I don’t know for sure but maybe
this is why the results don’t come in the range of input ratings. I haven’t
looked at the code in a long while.

If you are asking about the UR it would not take numeric ratings and the scores
cannot be compared to them.

For many reasons that I have written about before I always warn people about
using ratings, which have been discontinued as a source of input for Netflix
(who have removed them from their UX) and many other top recommender users.
There are many reasons for this, not the least of which is that they are
ambiguous and don’t directly relate to whether a user might like an item. For
instance most video sources now use something

Re: How to import item properties dynamically?

2017-12-22 Thread Pat Ferrel

The properties go into the Event Store immediately but you have to train to get 
them into the model, this assuming your template support item properties. If yo 
uare using the UR, the properties will not get into the model until the next 
`pio train…`


On Dec 22, 2017, at 3:37 AM, Noelia Osés Fernández  wrote:


Hi all,

I have a pio app and I need to update item properties regularly. However, not 
all items will have all properties always. So I want to update the properties 
dynamically doing something similiar to the following:

# create properties json
propertiesjson = '{'
if "tiempo" in dfcolumns:
propertiesjson = propertiesjson + '"tiempo": ' + 
str(int(plan.tiempo))
if "duracion" in dfcolumns:
propertiesjson = propertiesjson + ', "duracion": ' + 
str(plan.duracion)
propertiesjson = propertiesjson + '}'

# add event
client.create_event(
event="$set",
entity_type="item",
entity_id=plan.id_product,
properties=json.dumps(propertiesjson)
)


However, this results in an error message:


Traceback (most recent call last):
  File "import_itemproperties.py", line 110, in 
import_events(client, args.dbuser, args.dbpasswd, args.dbhost, args.dbname)
  File "import_itemproperties.py", line 73, in import_events
properties=json.dumps(propertiesjson)
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/__init__.py", 
line 255, in create_event
event_time).get_response()
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/connection.py", 
line 111, in get_response
self._response = self.rfunc(tmp_response)
  File 
"/home/ubuntu/.local/lib/python2.7/site-packages/predictionio/__init__.py", 
line 130, in _acreate_resp
response.body))
predictionio.NotCreatedError: request: POST 
/events.json?accessKey=0Hys1qwfgo3vF16jElBDJJnSLmrkN5Tg86qAPqepYPK_-lXMqI4NMjLXaBGgQJ4U
 {'entityId': 8, 'entityType': 'item', 'properties': '"{\\"tiempo\\": 2, 
\\"duracion\\": 60}"', 'event': '$set', 'eventTime': 
'2017-12-22T11:29:59.762+'} 
/events.json?accessKey=0Hys1qwfgo3vF16jElBDJJnSLmrkN5Tg86qAPqepYPK_-lXMqI4NMjLXaBGgQJ4U?entityId=8=item=%22%7B%5C%22tiempo%5C%22%3A+2%2C+%5C%22duracion%5C%22%3A+60%2C=%24set=2017-12-22T11%3A29%3A59.762%2B
 status: 400 body: {"message":"org.json4s.package$MappingException: Expected 
object but got JString(\"{\\\"tiempo\\\": 2, \\\"duracion\\\": 60}\")"}


Any help is much appreciated!
Season's greetings!
Noelia

Re: Recommendation return score more than 5

2017-12-22 Thread Pat Ferrel

Implicit means you assign a score to the event based on your own guess.
Explicit uses ratings the user makes. One score is a guess by you (like a 4 for
buy) and the other is a rating made by the user. ALS comes in 2 flavors, one
for explicit scoring, used to predict rating and the other for implicit scoring
used to predict something the user will prefer.

Make sure your template is using the explicit version of ALS.
https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback

<https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html#explicit-vs-implicit-feedback>

On Dec 21, 2017, at 11:09 PM, GMAIL <babaevka...@gmail.com> wrote:

And I still did not understand exactly where to look for me and what to
correct, so that incoming and predicted estimates were on the same scale.

On Dec 18, 2017, at 10:42 AM, GMAIL <babaevka...@gmail.com
<mailto:babaevka...@gmail.com>> wrote:

2017-12-18 21:35 GMT+03:00 Pat Ferrel <p...@occamsmachete.com
<mailto:p...@occamsmachete.com>>:
The UR and the Recommendations Template use very different technology
underneath.

If you are asking about the UR it would not take numeric ratings and the scores
cannot be compared to them.

On Dec 18, 2017, at 12:32 AM, GMAIL <babaevka...@gmail.com
<mailto:babaevka...@gmail.com>> wrote:

Does it seem to me or UR strongly differs from Recommender?
At least I can't find method getRatings in class DataSource, which contains all
events, in particular, "rate", that I needed.

2017-12-18 11:14 GMT+03:00 Noelia Osés Fernández <no...@vicomtech.org
<mailto:no...@vicomtech.org>>:
I didn't solve the problem :(

Now I use the universal recommender

On 18 December 2017 at 09:12, GMAIL <babaevka...@gmail.com
<mailto:babaevka...@gmail.com>> wrote:
And how did you solve this problem? Did you divide prediction score by 2?

2017-12-18 10:40 GMT+03:00 Noelia Osés Fernández <no...@vicomtech.org
<mailto:no...@vicomtech.org>>:
I got the same problem. I still don't know the answer to your question :(

On 17 December 2017 at 14:07, GMAIL <babaevka...@gmail.

Re: Recommended Configuration

2017-12-15 Thread Pat Ferrel

That is enough for a development machine and may work if you data is relatively 
small but for big data clusters of CPU with a fair amount of RAM and Storage 
are required. The telling factor is partly how big your data is but also how is 
combines to form models, which will depend on which recommender you are using. 

We usually build big clusters to analyze the data then downsize them when we 
see how much is needed. if you have small data, < 1m events, you may try a 
single machine. 


On Dec 15, 2017, at 3:59 AM, GMAIL  wrote:

Hi. 
Could you tell me recommended configuration for comfort work PredictionIO 
Recommender Template. 
I read that I need 16Gb RAM, but what about the rest (CPU/Storage/GPU(?))? 

P.S. sorry for my English.

Re: User features to tailor recs in UR queries?

2017-12-05 Thread Pat Ferrel

The User’s possible indicators of taste are encoded in the usage data. Gender
and other “profile" type data can be encoded a (user-id, gender, gender-id) but
this is used and a secondary indicator, not as a filter. Only item properties
are used a filters for some very practical reasons. For one thing items are
what you are recommending so you would have to establish some relationship
between items and gender of buyers. The UR does this with user data in
secondary indicators but does not filter by these because they are calculated
properties, not ones assigned by humans, like “in-stock” or “language”

Location is an easy secondary indicator but needs to be encoded with “areas”
not lat/lon, so something like (user-id, location-of-purchase,
country-code+postal-code) This would be triggered when a primary event happens,
such as a purchase. This way locaiton is accounted for in making
recommendations without your haveing to do anything but feed in the data.

Lat/lon roximity filters are not implemented but possible.

One thing to note is that fields used to filter or boost are very different
than user taste indicators. For one thing they are never tested for correlation
with the primary event (purchase, read, watch,…) so they can be very dangerous
to use unwisely. They are best used for business rules like only show
“in-stock” or in this video carousel show only video of the “mystery” genre.
But if you use user profile data to filter recommendation you can distort what
is returned and get bad results. We once had a client that waanted to do this
against out warnings, filtering by location, gender, and several other things
known about the user and got 0 lift in sales. We convinced they to try without
the “business rules” and got good lift in sales. User taste indicators are best
left to the correlation test by inputting them as user indicator data—except
where you purposely want to reduce the recommendations to a subset for a
business reason.

Piut more simply, business rules can kill the value of a recommender, let it
figure out whether and indicator matters. And always remember that indicators
apply to users, filters and boosts apply to items and known properties of
items. It may seem like genre is both a user taste indicator and an item
property but if you input them in 2 ways they can be used in 2 ways. 1) to make
better recommendations, 2) in business rules. They are stored and used in
completely different ways.

On Dec 5, 2017, at 7:59 AM, Noelia Osés Fernández wrote:

Hi all,

I have seen how to use item properties in queries to tailor the recommendations
returned by the UR.

But I was wondering whether it is possible to use user characteristics to do
the same. For example, I want to query for recs from the UR but only taking
into account the history of users that are female (or only using the history of
users in the same county). Is this possible to do?

I've been reading the UR docs but couldn't find info about this.

Thank you very much!

Best regards,
Noelia

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
.
To post to this group, send email to actionml-u...@googlegroups.com
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/CAMysefu-8mOgh3NsRkRVN6H6bRm6hR%2B1HuryT4wqgtXZD3norg%40mail.gmail.com

.
For more options, visit https://groups.google.com/d/optout
.

Re: Log-likelihood based correlation test?

2017-11-23 Thread Pat Ferrel

Use the default. Tuning with a threshold is only for atypical data and unless 
you have a harness for cross-validation you would not know if you were making 
things worse or better. We have our own tools for this but have never had the 
need for threshold tuning. 

Yes, llrDownsampled(PtP) is the “model”, each doc put into Elasticsearch is a 
sparse representation of a row from it, along with those from PtV, PtC,… Each 
gets a “field” in the doc.


On Nov 22, 2017, at 6:16 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row in 
PtP?

On 21 November 2017 at 19:56, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
No PtP non-zero elements have LLR calculated. The highest scores in the row are 
kept, or ones above some threshold hte resst are removeda as “noise". These are 
put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the 
model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has 
several benefits over pure cosines (it actually consists of adjustments to 
cosine) and we also use norms. With ES 5 we should see quality improvements due 
to this. 
https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html
 
<https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP 
are removed by the LLR (set to zero, to be precise). But the elements that 
survive are calculated by matrix multiplication. The final PtP is put into 
EleasticSearc and when we query for user recommendations ES uses KNN to find 
the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix 
multiplication, and I'm assuming that the P matrix only has 0s and 1s to 
indicate which items have been purchased by which user, then the elements of 
PtP are either 0 or greater to or equal than 1. However, the scores I get are 
below 1.

So is the KNN using cosine similarity as a metric to calculate the closest 
neighbours? And is the results of this cosine similarity metric what is 
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine 
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== that is the simple explanation 


Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes

Re: Log-likelihood based correlation test?

2017-11-20 Thread Pat Ferrel

Yes, this will show the model. But if you do this a lot there are tools like 
Restlet that you plug in to Chrome. They will allow you to build queries of all 
sorts. For instance 
GET http://localhost:9200/urindex/_search?pretty 

will show the item rows of the UR model put into the index for the integration 
test data. The UI is a bit obtuse but you can scroll down in the right pane 
expanding bits of JSON as you go to see this:

"hits":{
"total": 7,
"max_score": 1,
"hits":[
{
"_index": "urindex_1511033890025",
"_type": "items",
"_id": "Nexus",
"_score": 1,
"_source":{
"defaultRank": 4,
"expires": "2017-11-04T19:01:23.655-07:00",
"countries":["United States", "Canada"],
"id": "Nexus",
"date": "2017-11-02T19:01:23.655-07:00",
"category-pref":["tablets"],
"categories":["Tablets", "Electronics", "Google"],
"available": "2017-10-31T19:01:23.655-07:00",
"purchase":[],
"popRank": 2,
"view":["Tablets"]
}
},

As you can see no purchased items survived the correlation test, one survived 
the view and category-pref correlation tests. The other fields are item 
properties set using $set events and are used with business rules.

 With something like this tool you can even take the query logged in the 
deployed PIO server and send it to see how the query is constructed and what 
the results are (same as you get from the SDK I’ll wager :-)



On Nov 20, 2017, at 7:07 AM, Daniel Gabrieli <dgabri...@salesforce.com> wrote:

There is a REST client for Elasticsearch and bindings in many popular languages 
but to get started quickly I found this commands helpful:

List Indices:

curl -XGET 'localhost:9200/_cat/indices?v'

Get some documents from an index:

curl -XGET 'localhost:9200//_search?q=*'

Then look at the "_source" in the document to see what values are associated 
with the document.

More info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source
 
<https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source>

this might also be helpful to work through a single specific query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html
 
<https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html>





On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:
Thanks Daniel!

And excuse my ignorance but... how do you inspect the ES index?

On 20 November 2017 at 15:29, Daniel Gabrieli <dgabri...@salesforce.com 
<mailto:dgabri...@salesforce.com>> wrote:
There is this cli tool and article with more information that does produce 
scores:

https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html 
<https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html>

But I don't know of any commands that return diagnostics about LLR from the PIO 
framework / UR engine.  That would be a nice feature if it doesn't exist.  The 
way I've gotten some insight into what the model is doing is by when using PIO 
/ UR is by inspecting the the ElasticSearch index that gets created because it 
has the "significant" values populated in the documents (though not the actual 
LLR scores).

On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:
This thread is very enlightening, thank you very much!

Is there a way I can see what the P, PtP, and PtL matrices of an app are? In 
the handmade case, for example?

Are there any pio calls I can use to get these?

On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=== th

Re: Error in getting Total Events in a predictionIo App

2017-11-14 Thread Pat Ferrel

You should use pio 0.12.0 if you need Elasticsearch 5.x


On Nov 14, 2017, at 6:39 AM, Abhimanyu Nagrath  
wrote:

Hi , I am new to predictionIo using version V0.11-incubating (spark - 2.6.1 , 
hbase - 1.2.6 , elasticsearch - 5.2.1) . Started the prediction server with 
./pio-start-all and checked Pio status these are working fine. Then I created 
an app 'testApp' and imported some events into that predictionIO app, Now 
inorder to verify the count of imported events .I ran the following commands 

 1. pio-shell --with-spark
 2. import org.apache.predictionio.data.store.PEventStore
 3. val eventsRDD = PEventStore.find(appName="testApp")(sc)

I got the error:

ERROR Storage$: Error initializing storage client for source ELASTICSEARCH
java.lang.ClassNotFoundException: elasticsearch.StorageClient
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at 
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:228)
at 
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:254)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at 
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at 
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
at 
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:215)
at 
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:284)
at 
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:269)
at 
org.apache.predictionio.data.storage.Storage$.getMetaDataApps(Storage.scala:387)
at 
org.apache.predictionio.data.store.Common$.appsDb$lzycompute(Common.scala:27)
at org.apache.predictionio.data.store.Common$.appsDb(Common.scala:27)
at 
org.apache.predictionio.data.store.Common$.appNameToId(Common.scala:32)
at 
org.apache.predictionio.data.store.PEventStore$.find(PEventStore.scala:71)
at 
$line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
at $line19.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:37)
at $line19.$read$$iwC$$iwC$$iwC$$iwC.(:39)
at $line19.$read$$iwC$$iwC$$iwC.(:41)
at $line19.$read$$iwC$$iwC.(:43)
at $line19.$read$$iwC.(:45)
at $line19.$read.(:47)
at $line19.$read$.(:51)
at $line19.$read$.()
at $line19.$eval$.(:7)
at $line19.$eval$.()
at $line19.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org 
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org

Re: Which template for predicting ratings?

2017-11-13 Thread Pat Ferrel

What I was saying is the UR can use ratings, but not predict them. Use MLlib 
ALS recommenders if you want to predict them for all items.


On Nov 13, 2017, at 9:32 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

What we did in the article I attached is assume 1-2 is dislike, and 4-5 is like.

These are treated as indicators and will produce a score from the recommender 
but these do not relate to 1-5 scores.

If you need to predict what the user would score an item MLlib ALS templates 
will do it.



On Nov 13, 2017, at 2:42 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:

Hi Pat,

I truly appreciate your advice.

However, what to do with a client that is adamant that they want to display the 
predicted ratings in the form of 1 to 5-stars? That's my case right now. 

I will pose a more concrete question. Is there any template for which the 
scores predicted by the algorithm are in the same range as the ratings in the 
training set?

Thank you very much for your help!
Noelia

On 10 November 2017 at 17:57, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Any of the Spark MLlib ALS recommenders in the PIO template gallery support 
ratings.

However I must warn that ratings are not very good for recommendations and none 
of the big players use ratings anymore, Netflix doesn’t even display them. The 
reason is that your 2 may be my 3 or 4 and that people rate different 
categories differently. For instance Netflix found Comedies were rated lower 
than Independent films. There have been many solutions proposed and tried but 
none have proven very helpful.

There is another more fundamental problem, why would you want to recommend the 
highest rated item? What do you buy on Amazon or watch on Netflix? Are they 
only your highest rated items. Research has shown that they are not. There was 
a whole misguided movement around ratings that affected academic papers and 
cross-validation metrics that has fairly well been discredited. It all came 
from the Netflix prize that used both. Netflix has since led the way in 
dropping ratings as they saw the things I have mentioned.

What do you do? Categorical indicators work best (like, dislike)or implicit 
indicators (buy) that are unambiguous. If a person buys something, they like 
it, if the rate it 3 do they like it? I buy many 3 rated items on Amazon if I 
need them. 

My advice is drop ratings and use thumbs up or down. These are unambiguous and 
the thumbs down can be used in some cases to predict thumbs up: 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>
 This uses data from a public web site to show significant lift by using “like” 
and “dislike” in recommendations. This used the Universal Recommender.


On Nov 10, 2017, at 5:02 AM, Noelia Osés Fernández <no...@vicomtech.org 
<mailto:no...@vicomtech.org>> wrote:


Hi all,

I'm new to PredictionIO so I apologise if this question is silly.

I have an application in which users are rating different items in a scale of 1 
to 5 stars. I want to recommend items to a new user and give her the predicted 
rating in number of stars. Which template should I use to do this? Note that I 
need the predicted rating to be in the same range of 1 to 5 stars.

Is it possible to do this with the ecommerce recommendation engine?

Thank you very much for your help!
Noelia









-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

Re: Does PIO support [ --master yarn --deploy-mode cluster ]?

2017-11-13 Thread Pat Ferrel

yarn-cluster mode is supported but extra config needs to be set so the driver
can be run on a remote machine.

I have seen instructions for this on the PIO mailing list.

On Nov 12, 2017, at 7:30 PM, wei li wrote:

Hi Pat
Thanks a lot for your advice.

We are using [yarn-client] mode now, UR trains well and we can monitor the
output log at pio application console.

I tried to find a way to use [yarn-cluster] mode, to submit a train job and
shutdown the pio application (in docker) immediately.
(monitor the job process at hadoop culster website instead of pio application
console).
But then I met errors like this: file path [file://xxx.jar] can not be found.

Maybe, [yarn-cluster] mode is not supported now. I will keep looking for the
explanation.

在 2017年11月11日星期六 UTC+8上午12:41:33，pat写道：
Yes PIO support Yarn but you may have more luck getting an explanation on the
PredictionIO mailing list.
Subscribe here: http://predictionio.incubator.apache.org/support/

On Nov 9, 2017, at 11:33 PM, wei li wrote:

Hi, all

Any one have any idea about this?

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-use...@googlegroups.com .
To post to this group, send email to action...@googlegroups.com .
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/af5c6748-ae7f-4c05-bbc5-6dcf6c1a480a%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout
.

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
.
To post to this group, send email to actionml-u...@googlegroups.com
.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/8668b1a1-09b9-4de8-aedb-5b786a9cf7e4%40googlegroups.com

.
For more options, visit https://groups.google.com/d/optout
.

Re: PIO + ES5 + Universal Recommender

2017-11-08 Thread Pat Ferrel

“mvn not found”, install mvn. 

This step will go away with the next Mahout release.


On Nov 8, 2017, at 2:41 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thanks Pat!

I have followed the instructions on the README.md file of the mahout folder:


You will need to build this using Scala 2.11. Follow these instructions

 - install Scala 2.11 as your default version

I've done this with the following commands:

# scala install
wget www.scala-lang.org/files/archive/scala-2.11.7.deb 
<http://www.scala-lang.org/files/archive/scala-2.11.7.deb>
sudo dpkg -i scala-2.11.7.deb
# sbt installation
echo "deb https://dl.bintray.com/sbt/debian <https://dl.bintray.com/sbt/debian> 
/" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 
<http://keyserver.ubuntu.com/> --recv 642AC823
sudo apt-get update
sudo apt-get install sbt

 - download this repo: `git clone https://github.com/actionml/mahout.git` 
<https://github.com/actionml/mahout.git%60>
 - checkout the speedup branch: `git checkout sparse-speedup-13.0`
 - edit the build script `build-scala-2.11.sh <http://build-scala-2.11.sh/>` to 
put the custom repo where you want it

This file is now:

#!/usr/bin/env bash

git checkout sparse-speedup-13.0

mvn clean package -DskipTests -Phadoop2 -Dspark.version=2.1.1 
-Dscala.version=2.11.11 -Dscala.compat.version=2.11

echo "Make sure to put the custom repo in the right place for your machine!"
echo "This location will have to be put into the Universal Recommenders 
build.sbt"

mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/hdfs/target/mahout-hdfs-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-hdfs -Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/math/target/mahout-math-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-math -Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/math-scala/target/mahout-math-scala_2.11-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-math-scala_2.11 
-Dversion=0.13.0
mvn deploy:deploy-file 
-Durl=file:///home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/.custom-scala-m2/repo/
 
-Dfile=//home/ubuntu/PredictionIO/apache-predictionio-0.12.0-incubating/PredictionIO-0.12.0-incubating/vendors/mahout/spark/target/mahout-spark_2.11-0.13.0.jar
 -DgroupId=org.apache.mahout -DartifactId=mahout-spark_2.11 -Dversion=0.13.0

 - execute the build script `build-scala-2.11.sh <http://build-scala-2.11.sh/>`

This outputed the following:

$ ./build-scala-2.11.sh <http://build-scala-2.11.sh/> 
Mbuild-scala-2.11.sh <http://build-scala-2.11.sh/>
Already on 'sparse-speedup-13.0'
Your branch is up-to-date with 'origin/sparse-speedup-13.0'.
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 5: mvn: command not 
found
Make sure to put the custom repo in the right place for your machine!
This location will have to be put into the Universal Recommenders build.sbt
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 10: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 11: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 12: mvn: command not 
found
./build-scala-2.11.sh <http://build-scala-2.11.sh/>: line 13: mvn: command not 
found


Do I need to install MAVEN? If so, it is not said in the PredictionIO 
installation instructions nor on the Mahout instructions. 

I apologise if this is an obvious question for those familiar with the Apache 
projects, but for an outsider like me it helps when everything (even the most 
silly details) is spelled out. Thanks a lot for all your invaluable help!!
 

On 7 November 2017 at 20:58, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Very sorry, it was incorrectly set to private. Try it again.




On Nov 7, 2017, at 7:26 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

https://github.com/actionml/mahout <https://github.com/actionml/mahout>





-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigado

Re: PIO + ES5 + Universal Recommender

2017-11-07 Thread Pat Ferrel

Very sorry, it was incorrectly set to private. Try it again.




On Nov 7, 2017, at 7:26 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

https://github.com/actionml/mahout <https://github.com/actionml/mahout>

Re: PIO + ES5 + Universal Recommender

2017-11-07 Thread Pat Ferrel

Very sorry, it was incorrectly set to private. Try it again.



On Nov 7, 2017, at 12:52 AM, Noelia Osés Fernández <no...@vicomtech.org> wrote:

Thank you, Pat!

I have a problem with the Mahout repo, though. I get the following error 
message:

remote: Repository not found.
fatal: repository 'https://github.com/actionml/mahout.git/ 
<https://github.com/actionml/mahout.git/>' not found


On 3 November 2017 at 22:27, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
The exclusion rules are working now along with the integration-test. We have 
some cleanup but please feel free to try it.

Please note the upgrade issues mentioned below before you start, fresh installs 
should have no such issues.


On Nov 1, 2017, at 4:30 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:

Ack, I hate this &^%&%^&  touchbar!

What I meant to say was:


We have a version of the universal recommender working with PIO-0.12.0 that is 
ready for brave souls to test. This includes some speedups and quality of 
recommendation improvements, not yet documented. 

Known bugs: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. The 
test will be fixed before release but do trust it to populate PIO with some 
sample data you can play with.

You must build PredictionIO with the default parameters so just run 
`./make-distribution` this will require you to install Scala 2.11, Spark 2.1.1 
or greater, ES 5.5.2 or greater, Hadoop 2.6 or greater. If you have issues 
getting pio to build and run send questions to the PIO mailing list. Once PIO 
is running test with `pio status` and `pio app list`. You will need to create 
an app in import your data to run the integration test to get some sample data 
installed in the “handmade” app.

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually 
even worse it is still in HBase but you can’t get at it so to upgrade so the 
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no 
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo: 
https://github.com/actionml/mahout.git <https://github.com/actionml/mahout.git> 
follow the instructions in the README.md
download the UR from here: 
https://github.com/actionml/universal-recommender.git 
<https://github.com/actionml/universal-recommender.git> and checkout branch 
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at 
"file:///Users/pat/.custom-scala-m2/repo”` <> with your path to the local 
mahout build
build the UR with `pio build` or run the integration test to get sample data 
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com
 
<https://groups.google.com/d/msgid/actionml-user/326BE669-574B-45A5-AAA5-6A285BA0B33E%40occamsmachete.com?utm_medium=email_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

no...@vicomtech.org <mailto:no...@vicomtech.org>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/> <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to actionml-user+unsubscr...@googlegroups.com 
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com 
<

Re: Implementing cart and wishlist item events into Ecommerce recommendation template

2017-11-04 Thread Pat Ferrel

Oh, forgot to say the most important part. The ECom recommender does not
support shopping carts unless you train on (cart-id, item-id-of item
added-to-cart) And even then I’m not sure you can query with the current cart’s
contents since the item-based query is for a single item. The cart-id takes the
place of user-id in this method of training and there may be a way to do this
in the MLlib implementation but It isn’t surfaced in the PIO interface. It
would be explained as an anonymous user (one not in the training data) and will
take an item list in the query. Look into the MLlib ALS library and expect to
modify the template code.

There is also the Complimentary Purchase template, which does shopping carts
but, from my rather prejudiced viewpoint, if you need to switch templates use
one that supports every use-case you are likely to need.

On Nov 4, 2017, at 9:34 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

The Universal Recommender supports several types of “item-set” recommendations:
1) Complimentary Purchases. which are things bought with what you have in the
shopping cart. This is done by training on (cart-id, “add-to-cart”, item-id)
and querying with the current items in the user’s cart.
2) Similar items to those in the cart, this is done by training with the
typical events like purchase, detail-view, add-to-cart., etc. for each user,
then the query is the contents of the shopping cart as a “item-set”. This give
things similar to what is in the cart and usually not the precise semantics for
a shopping cart but fits other cases of using an items-set, like wish-lists
3) take the last n items viewed and query with them and you have
“recommendations based on your recent views” In this case you need purchases as
the primary event because you want to recommend purchases but using only
“detail-views” to do so.
4) some other combinations like favorites, watch-lists, etc.

These work slightly different and I could give examples of how they are used in
Amazon but #1 is typically used for the “shopping cart"

On Nov 3, 2017, at 7:13 PM, ilker burak <ilkerbu...@gmail.com
<mailto:ilkerbu...@gmail.com>> wrote:

Hi Vaghan,
I will check that. Thanks for your help and quick answer about this.

On Fri, Nov 3, 2017 at 8:02 AM, Vaghawan Ojha <vaghawan...@gmail.com
<mailto:vaghawan...@gmail.com>> wrote:
Hey there,

did you consider seeing this:
https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/

<https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/>

for considering such events you may want to use the $set events as shown in the
template documentation. I use universal recommender though since already
supports these requirements.

Hope this helps.

On Fri, Nov 3, 2017 at 10:37 AM, ilker burak <ilkerbu...@gmail.com
<mailto:ilkerbu...@gmail.com>> wrote:
Hello,
I am using Ecommerce recommendation template. Currently i imported view and buy
events and it works. To improve results accuracy, how can i modify code to
import and use events like 'user added item to cart' and 'user added item to
wishlist'? I know this template supports to add new events but there is only
example in site about how to implement rate event, whic i am not using rate
data.
Thank you

Ilker

--
You received this message because you are subscribed to the Google Groups
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to actionml-user+unsubscr...@googlegroups.com
<mailto:actionml-user+unsubscr...@googlegroups.com>.
To post to this group, send email to actionml-u...@googlegroups.com
<mailto:actionml-u...@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/actionml-user/741770E9-1FE6-453C-9ED2-54F1745CAE33%40occamsmachete.com

<https://groups.google.com/d/msgid/actionml-user/741770E9-1FE6-453C-9ED2-54F1745CAE33%40occamsmachete.com?utm_medium=email_source=footer>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.

Re: Implementing cart and wishlist item events into Ecommerce recommendation template

2017-11-04 Thread Pat Ferrel

These work slightly different and I could give examples of how they are used in
Amazon but #1 is typically used for the “shopping cart"

On Nov 3, 2017, at 7:13 PM, ilker burak wrote:

Hi Vaghan,
I will check that. Thanks for your help and quick answer about this.

On Fri, Nov 3, 2017 at 8:02 AM, Vaghawan Ojha > wrote:
Hey there,

did you consider seeing this:
https://predictionio.incubator.apache.org/templates/ecommercerecommendation/train-with-rate-event/

for considering such events you may want to use the $set events as shown in the
template documentation. I use universal recommender though since already
supports these requirements.

Hope this helps.

On Fri, Nov 3, 2017 at 10:37 AM, ilker burak > wrote:
Hello,
I am using Ecommerce recommendation template. Currently i imported view and buy
events and it works. To improve results accuracy, how can i modify code to
import and use events like 'user added item to cart' and 'user added item to
wishlist'? I know this template supports to add new events but there is only
example in site about how to implement rate event, whic i am not using rate
data.
Thank you

Ilker

Re: PIO + ES5 + Universal Recommender

2017-11-01 Thread Pat Ferrel

Ack, I hate this &^%&%^&  touchbar!

What I meant to say was:


We have a version of the universal recommender working with PIO-0.12.0 that is 
ready for brave souls to test. This includes some speedups and quality of 
recommendation improvements, not yet documented. 

Known bugs: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. The 
test will be fixed before release but do trust it to populate PIO with some 
sample data you can play with.

You must build PredictionIO with the default parameters so just run 
`./make-distribution` this will require you to install Scala 2.11, Spark 2.1.1 
or greater, ES 5.5.2 or greater, Hadoop 2.6 or greater. If you have issues 
getting pio to build and run send questions to the PIO mailing list. Once PIO 
is running test with `pio status` and `pio app list`. You will need to create 
an app in import your data to run the integration test to get some sample data 
installed in the “handmade” app.

*Backup your data*, moving from ES 1 to ES 5 will delete all data Actually 
even worse it is still in HBase but you can’t get at it so to upgrade so the 
following
`pio export` with pio < 0.12.0 =*Before upgrade!*=
`pio data-delete` all your old apps =*Before upgrade!*=
build and install pio 0.12.0 including all the services =*The point of no 
return!*=
`pio app new …` and `pio import…` any needed datasets
download and build Mahout for Scala 2.11 from this repo: 
https://github.com/actionml/mahout.git  
follow the instructions in the README.md
download the UR from here: 
https://github.com/actionml/universal-recommender.git 
 and checkout branch 
0.7.0-SNAPSHOT
replace the line: `resolvers += "Local Repository" at 
"file:///Users/pat/.custom-scala-m2/repo”` 
 with your path to the 
local mahout build
build the UR with `pio build` or run the integration test to get sample data 
put into PIO `./examples/integration-test`

This will use the released PIO and alpha UR

This will be much easier when it’s released

PIO + ES5 + Universal Recommender

2017-11-01 Thread Pat Ferrel

We have a version working here: 
https://github.com/actionml/universal-recommender.git 

checkout 0.7.0-SNAPSHOT once you pull the repo. 

Known bug: exclusion rules not working. This will be fixed before release in 
the next few days

Issues: do not trust the integration test, Lucene and ES have changed their 
scoring method and so you cannot compare the old scores to the new ones. the 
test will be fixed before release.

You must build the Template with pio v0.12.0 using Scala 2.11, Spark 2.2.1, ES 
5.

Templates First

2017-10-20 Thread Pat Ferrel

PredictionIO is completely useless without a Template yet we seem as a group 
too focused on releasing PIO without regard for Templates. This IMO must 
change. 90% of users will never touch the code of a template and only 1% will 
actually create a template. These guesses come from list questions. If this is 
true we need to switch our mindset to "templates first” not “pio first”. Before 
any upgrade vote, every committer should make sure their favorite template 
works with the new build. I will be doing so from now on.

We have one fairly significant problem that I see from a template supporter's 
side. PIO has several new build directives that change dependencies like Spark 
version and tools like Scala version. These are unknown to templates and there 
is no PIO supported way to communicate these to the template's build.sbt. This 
leaves us with templates that will not work with most combinations of PIO 
builds. If we are lucky they may be updated to work with the *default* pio 
config. But this did not happen when PIO-0.12.0 was released, only shortly 
afterwards. This must change, the Apache templates at least must have some 
support for PIO before release and here is one idea that might help...

How do we solve this?

1) .make-distribution modifies or creates a script that can be imported by the 
templates build.sbt. This might be pio-env if we use `pio build` to build 
templates because it is available to the template’s build.sbt, or something 
else when we move to using sbt to build templates directly. This script defines 
values used to build PIO.
2) update some or all of the Apache templates to use this mechanism to build 
with the right scala version, etc. taken from the PIO build.

I had a user do this for the UR to support many different pio build directives, 
and some that are new. The result was a build.sbt that includes such things as 

val pioVersion = sys.env.getOrElse("PIO_VERSION","0.12.0-incubating”)
val scalaVersion = sys.env.getOrElse(“PIO_SCALA_VERSION”, “2.10.0”)
val elasticsearch1Version = sys.env.getOrElse("PIO_ELASTIC_VERSION","1.7.5")
val sparkVersion = sys.env.getOrElse("PIO_SPARK_VERSION","1.4.0”)

these are then used in the lib dependencies lists to pull in the right versions 
of artifacts.

This in some form would allow templates to move along in lock step with changes 
in the way pio is built on any given machine. Without something like this, 
users even less expert at sbt than myself (hard to imagine) will have a 
significant problem dumped on them.

Since this is only partially baked it may not be ready for a Jira and so 
warrants discussion.

85 matches

Mail list logo