Gave you the wrong schema entries for the advice about queries. Check with Solr documentation, which always trumps my guesses.
To use token phrases do the following: <fieldType name=“indicator" class="solr.TextField" omitNorms=“false”> <!— This simple tokenizer will split the text by spaces (and other punctuation) to separate item-id tokens —> <tokenizer class="solr.StandardTokenizerFactory”/> < /fieldType> <field name=“purchase" stored=“true" type="indicator" multiValued=“false" indexed="true”/> > On Apr 16, 2015, at 9:15 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > OK, this is cool. Almost there! > > In order to answer the question you have to decide how you will persist the > indicators. The output of spark-itemsimilarity can be directly indexed but > requires phrase searches on text fields. It requires the item-id strings to > be tokenized so they aren’t broken up in the analyzer used for Solr phrase > queries. The better way to do it is store the indicators as multi-valued > fields on Solr or a DB. Many ways to slice this. > > If you want to index the output of Mahout directly we will assume the item > IDs are tokenized and so contain no spaces, ., comma, or other punctuation > that will break a phrase, so we can encode the user history as a single > string of space separated item tokens. > > To do a query with something like “ipad iphone” we’ll need to setup Solr like > this, which is a bit of a guess since I use a DB—not the raw output files: > > <fieldType name=“indicator" class="solr.TextField" omitNorms=“false"/> <!— > NOTICE NO TOKENIZER OR ANALYZER USED—> > <field name=“purchase" stored=“true" type="indicator" multiValued="true" > indexed="true”/> My bad, this is for multi-valued fields, see above for space delimited token fields. I believe the above should use class=“solr.stringField” also. > > “OR” is the default query operator so unless you’ve messed with that it > should be fine. You need that because if you have multiple fields in the > future you want them ORed as well as the terms The query would be something > like: > > q=purchase: (“iphone ipad") > > So you are applying the items the user has purchased only to the purchase > indicator field. As you add cross-cooccurrence actions the fieldType will > stay the same and you will add a field for “views". > > q=purchase: (“iphone ipad”) view: (“history of items viewed") > And so on. > > You’ll also need to index them in Solr as csv files with no header. The > output is tab delimited by default so more correctly a tsv. > > This can be setup to use multi-valued fields but you’d have to store the > output of spark-itemsimilarity in Solr or a db. I’d actually recommend this > for several reasons including that it is faster than HDFS but it requires you > write storage code and customize the Solr config differently. > > other answers below: > > >> On Apr 16, 2015, at 7:35 AM, Pasmanik, Paul <paul.pasma...@danteinc.com> >> wrote: >> >> Thanks, Pat. >> I have a question regarding the search on the multi-valued field. >> So, once I have indexed the results of the spark-itemsimilarity for purchase >> field as multi-valued field in Solar, what kind of search do I perform using >> user's purchase history? Is it a phrase search (purchase items separated >> by space, something like purchase: "iphone ipad" with or without high slope >> value) or is it an OR query using each purchased item (purchase: ("iphone" >> OR "ipad")) or something totally different? >> >> My understanding is that if I have a document with purchase field that has >> values: 1,2,3,4,5 and another document that has values 3,4,5 and my purchase >> history has 1,2,4 then the first document should rank higher. > > Yes. The longer answer is that Solr with omitNorms=“false” will TF-IDF weight > terms (in this case individual indicators). So very frequently preferred > items will be down-weighted on the theory that because you and I like > "motherhood and apple pie", it doesn’t say much about our similarity of > taste--everyone likes those. So actual results will depend on frequency of > item preferences in the corpus. This down-weighting is a good thing since > otherwise a the popular things would always be recommended. > >> Thanks. >> >> >> -----Original Message----- >> From: Pat Ferrel [mailto:p...@occamsmachete.com] >> Sent: Tuesday, April 07, 2015 7:15 PM >> To: user@mahout.apache.org; Pasmanik, Paul >> Subject: Re: mahout 1.0 on EMR with spark item-similarity >> >> We are working on a release, which will be 0.10.0 so give it a try if you >> can. It fixes one problem that you may encounter with an out of range index >> in a vector. You may not see it. >> >> 1) The search engine must be able to take one query with multiple fields and >> apply each field in the query to separate fields in the index. Solr and ES >> work, not sure about Amazon. >> 2) Config and experiment seem good. >> 3) It is good practice to save you interactions in something like a db so >> they can be replayed to create indicators if needed and to maintain a time >> range of data. I use a key-value store like the search engine itself or >> NoSQL DB. The value is that you can look at the collection as an item >> catalog and so put metadata in columns or doc fields. This metadata can be >> used to match context later so if you are on a “men’s clothes” page you may >> want “people who bought this also bought these” but biased or filtered by >> the “men’s clothes” category. >> 4) Tags can be used for CF or for content-based recs and CF would generally >> be better. In the case you ask about the query is items favored since >> spark-rowsimilarity will produce similar items (similar in their tags, not >> users who preferred). So the query is items. Extend this and text tokens >> (bag-of-words) can be used to calculate content-based indicators and recs >> that are personalized, not just "more items like this”. But again CF data >> would be preferable if available. >> >> As with your cluster investment I’d start small with clear usage based >> indicators and build up as needed based on your application data. >> >> Let us know how it goes >> >> >> On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <paul.pasma...@danteinc.com> >> wrote: >> >> Thanks, Pat. >> We are only running EMR cluster with 1 master and 1 core node right now and >> were using EMR AMI 3.2.3 which has Hadoop 2.4.0. We are using default >> configuration for spark (using aws script for spark) which I believe sets >> number of instances to 2. Spark version 1.1.0h >> (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md) >> >> We are not in production yet as we are experimenting right now. >> >> I have a question about the choice of the search engine to do >> recommendations. >> I know the Practical Machine Learning book and mahout docs talk about Solr. >> Do you see any issues with using Elastic Search or AWS Cloud Search? >> Also, looking at the content based indicator example on >> intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is >> used to produce itemid to items matrix, but then it says to use tags >> associated with purchases in the query for tags like this: >> Query: >> field: purchase; q:user's-purchase-history >> field: view; q:user's view-history >> field: tags; q:user's-tags-associated-with-purchases >> >> So, we are not providing the actual tags in the tags field query, are we? >> >> Thanks >> >> >> -----Original Message----- >> From: Pat Ferrel [mailto:p...@occamsmachete.com] >> Sent: Monday, April 06, 2015 2:33 PM >> To: user@mahout.apache.org >> Subject: Re: mahout 1.0 on EMR with spark item-similarity >> >> OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this >> works in other cases. >> >> BTW we are nearing a new release. It fixes one cooccurrence problem that you >> may run into if you are using lots of small files to contain the initial >> interaction input. This happens often when using Spark Streaming for input. >> >> If you want to try the source on github make sure to compile with >> -DskipTests since there is a failing test unrelated to the Spark code. Be >> aware that jar names have changed if that matters. >> >> Can you report the cluster version of Spark and Hadoop as well as how many >> nodes? >> >> Thanks >> >> >> On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <paul.pasma...@danteinc.com> >> wrote: >> >> Pat, I was not using spark-submit script. I am using mahout >> spark-itemsimilarity exactly how it is specified in >> http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html >> >> So, what I did is I created a bootstrap action that installs spark and >> mahout on EMR cluster. Then, I used AWS Java APIs to create an EMR job step >> which can call a script (amazon provides scriptRunner that can run any >> script). So, I basically create a command (mahout spark-itemsimilarity >> <parameters>) and pass it to script runner that runs it. One of the >> parameters is -ma , so I pass in yarn-client. >> >> We use AWS java API to programmatically start EMR cluster (trigger by Quartz >> job) with whatever parameters that job needs. >> I used instructions in here: >> https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to >> install spark as bootstrap action. I built mahout-1.0 locally and uploaded >> a package to s3. I also created a bash script to copy that package from s3 >> to EMR, unpack, remove mahout 0.9 version that is part for EMR ami. Then I >> used another boostrap action to invoke that script and install mahout. I >> had to also make changes to mahout script. Added >> SPARK_HOME=/home/hadoop/spark >> (this is where I installed spark on EMR). Modified >> CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to >> avoid including classpath passed in by amazon script-runner since it >> contains path to the 2.11 version of scala (installed on EMR by Amazon) that >> conflicts with spark/mahout 2.10.x version. >> >> >> -----Original Message----- >> From: Pat Ferrel [mailto:p...@occamsmachete.com] >> Sent: Thursday, March 26, 2015 3:49 PM >> To: user@mahout.apache.org >> Subject: Re: mahout 1.0 on EMR with spark item-similarity >> >> Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity >> with the spark-submit script? That shouldn’t work, the job is a standalone >> app and does not require, nor is it likely to work with spark-submit. >> >> Were you able to run on Yarn? How? >> >> On Jan 29, 2015, at 9:15 AM, Pat Ferrel <p...@occamsmachete.com> wrote: >> >> There are two indices (guava HashBiMaps) that map your ID into and out of >> Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs >> and column/itemIDS) per physical machine that all local tasks consult. They >> are Spark broadcast values. These will grow linearly as the number of items >> and users grow and as the size of your IDs, treated as strings, grow. The >> hashmaps have some overhead but in large collections the main cost is the >> size of the application IDs stored as strings, Mahout’s IDs are ints. >> >> On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <paul.pasma...@danteinc.com> >> wrote: >> >> I was able to get spark and mahout installed on EMR cluster as bootstrap >> actions and was able to run spark-itemsimilarity job via an EMR step with >> some modifications to mahout script (defining SPARK_HOME and making sure >> CLASSPATH is not picked up from the invoking script which is amazon's >> script-runner). >> >> I was only able to run this job using yarn-client (yarn-master is not able >> to submit to resource manager). >> >> In yarn-client mode the driver program runs in the client process and >> submits jobs to executors via yarn manager, so my question is how much >> memory does this driver need? >> Will the memory requirement vary based on the size of the input to >> spark-itemsimilarity? >> >> Thanks. >> >> >> -----Original Message----- >> From: Pasmanik, Paul [mailto:paul.pasma...@danteinc.com] >> Sent: Thursday, January 15, 2015 12:46 PM >> To: user@mahout.apache.org >> Subject: mahout 1.0 on EMR with spark >> >> Has anyone tried running mahout 1.0 on EMR with Spark? >> I've used instructions at >> https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get >> EMR cluster running spark. I am now able to deploy EMR cluster with Spark >> using AWS JAVA APIs. >> EMR allows running a custom script as bootstrap action which I can use to >> install mahout. >> What I am trying to figure out is whether I would need to build mahout every >> time I start EMR cluster or have pre-built artifacts and develop a script >> similar to what awslab is using to install spark? >> >> Thanks. >> >> >> >> ________________________________ >> The information contained in this electronic transmission is intended only >> for the use of the recipient and may be confidential and privileged. >> Unauthorized use, disclosure, or reproduction is strictly prohibited and may >> be unlawful. If you have received this electronic transmission in error, >> please notify the sender immediately. >> >> >> >> >> >> >> >> >> >