Re: Problem creating mahout vectors from solr index

Lance Norskog Mon, 15 Aug 2011 22:05:42 -0700

Learn the basics of a Lucene index. The Luke program:
http://code.google.com/p/luke/ lets you examine an index in detail to
learn the various parts. Don't bother learning them deeply, just get
the road map in your head. Take one of the indexes made by the Mahout
jobs and examine it; it should be very simple. Then examine your Solr
index. You may have to use a back-rev Solr to get one that matches the
Lucene in the Luke download.


Next, understand that Solr has a bunch of field types that are not in
Lucene, and which Luke will not understand. If you want to use
numbers, use the 'pint'/'pdouble' types. These match the Lucene
integer & double types.

If you plan to do Solr->Mahout a lot, I would bolt the Embedded Solr
library in as a data input model.

On Mon, Aug 15, 2011 at 4:04 AM, Grant Ingersoll <[email protected]> wrote:
> Where is the "title-clustering" field defined?
>
> On Aug 13, 2011, at 1:59 PM, [email protected] wrote:
>
>> I have solr version 1.4.1 and mahout version 0.4
>> I have created index from the xml files given in exampledocs directory. It 
>> is working fine. Queries are also working on those indexes. But when I'm 
>> trying to create mahout vectors from the solr index it is giving message 
>> wrote: 0 vectors
>> I've set termVectors="true" in the schema.xml file for all the fields.
>> Is there any configuration settings I'm missing?
>> write now I'm posting only two files solr.xml & payload.xml from 
>> example/solr/conf directory for trial.
>>
>> portion of my schema.xml looks like this:
>>   <field name="id" type="string" indexed="true" stored="true" 
>> required="true" termVectors="true" />
>>
>>   <field name="sku" type="textTight" indexed="true" stored="true" 
>> omitNorms="true" termVectors="true"/>
>>
>>   <field name="name" type="textgen" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="alphaNameSort" type="alphaOnlySort" indexed="true" 
>> stored="false" termVectors="true"/>
>>
>>   <field name="manu" type="textgen" indexed="true" stored="true" 
>> omitNorms="true" termVectors="true"/>
>>
>>   <field name="cat" type="text_ws" indexed="true" stored="true" 
>> multiValued="true" omitNorms="true" termVectors="true" />
>>
>>   <field name="features" type="text" indexed="true" stored="true" 
>> multiValued="true"/>
>>
>>   <field name="includes" type="text" indexed="true" stored="true" 
>> termVectors="true" termPositions="true" termOffsets="true" />
>>
>>
>>
>>   <field name="weight" type="float" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="price"  type="float" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="popularity" type="int" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="inStock" type="boolean" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>
>>
>>
>>
>>   <!-- Common metadata fields, named specifically to match up with
>>
>>     SolrCell metadata when parsing rich documents such as Word, PDF.
>>
>>     Some fields are multiValued only because Tika currently may return
>>
>>     multiple values for them.
>>
>>   -->
>>
>>   <field name="title" type="text" indexed="true" stored="true" 
>> multiValued="true" termVectors="true"/>
>>
>>   <field name="subject" type="text" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="description" type="text" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="comments" type="text" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="author" type="textgen" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="keywords" type="textgen" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="category" type="textgen" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="content_type" type="string" indexed="true" stored="true" 
>> multiValued="true" termVectors="true"/>
>>
>>   <field name="last_modified" type="date" indexed="true" stored="true" 
>> termVectors="true"/>
>>
>>   <field name="links" type="string" indexed="true" stored="true" 
>> multiValued="true" termVectors="true"/>
>>
>>
>>
>>
>>
>>   <!-- catchall field, containing all other searchable text fields 
>> (implemented
>>
>>        via copyField further on in this schema  -->
>>
>>   <field name="text" type="text" indexed="true" stored="false" 
>> multiValued="true"/>
>>
>>
>>
>>   <!-- catchall text field that indexes tokens both normally and in reverse 
>> for efficient
>>
>>        leading wildcard queries. -->
>>
>>   <field name="text_rev" type="text_rev" indexed="true" stored="false" 
>> multiValued="true"/>
>>
>>
>>
>>   <!-- non-tokenized version of manufacturer to make it easier to sort or 
>> group
>>
>>        results by manufacturer.  copied from "manu" via copyField -->
>>
>>   <field name="manu_exact" type="string" indexed="true" stored="false"/>
>>
>>
>>
>>   <field name="payloads" type="payloads" indexed="true" stored="true"/>
>>
>>
>>
>>   <!-- Uncommenting the following will create a "timestamp" field using
>>
>>        a default value of "NOW" to indicate when each document was indexed.
>>
>>     -->
>>
>>   <!--
>>
>>   <field name="timestamp" type="date" indexed="true" stored="true" 
>> default="NOW" multiValued="false"/>
>>
>>     -->
>>
>>
>>
>>
>>
>>   <!-- Dynamic field definitions.  If a field name is not found, 
>> dynamicFields
>>
>>        will be used if the name matches any of the patterns.
>>
>>        RESTRICTION: the glob-like pattern in the name attribute must have
>>
>>        a "*" only at the start or the end.
>>
>>        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, 
>> z_i)
>>
>>        Longer patterns will be matched first.  if equal size patterns
>>
>>        both match, the first appearing in the schema will be used.  -->
>>
>>   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
>>
>>   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true" 
>> termVectors="true"/>
>>
>> This is the output I'm getting:
>> hadoop@dahlia:/home/sai/project/mahout-distribution-0.4$ bin/mahout 
>> lucene.vector --dir 
>> /home/sai/project/apache-solr-1.4.1/example/solr/data/index/ --output 
>> /home/sai/project/output/part-out.vec --field title-clustering --idField id 
>> --dictOut /home/sai/project/output/dict.out --norm 2
>> Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
>> HADOOP_CONF_DIR=/usr/local/hadoop/conf
>> 11/08/11 14:55:15 INFO lucene.Driver: Output File: 
>> /home/sai/project/output/part-out.vec
>> 11/08/11 14:55:16 INFO util.NativeCodeLoader: Loaded the native-hadoop 
>> library
>> 11/08/11 14:55:16 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>> native-zlib library
>> 11/08/11 14:55:16 INFO compress.CodecPool: Got brand-new compressor
>> 11/08/11 14:55:16 INFO lucene.Driver: Wrote: 0 vectors
>> 11/08/11 14:55:16 INFO lucene.Driver: Dictionary Output file: 
>> /home/sai/project/output/dict.out
>> 11/08/11 14:55:16 INFO driver.MahoutDriver: Program took 1078 ms
>>
>>
>> sai
>
> --------------------------
> Grant Ingersoll
>
>
>
>



-- 
Lance Norskog
[email protected]

Re: Problem creating mahout vectors from solr index

Reply via email to