Problem creating mahout vectors from solr index

sai_ratnaparkhi Sat, 13 Aug 2011 11:01:25 -0700

I have solr version 1.4.1 and mahout version 0.4
I have created index from the xml files given in exampledocs directory. It is 
working fine. Queries are also working on those indexes. But when I'm trying to 
create mahout vectors from the solr index it is giving message wrote: 0 vectors
I've set termVectors="true" in the schema.xml file for all the fields.
Is there any configuration settings I'm missing?
write now I'm posting only two files solr.xml & payload.xml from 
example/solr/conf directory for trial.


portion of my schema.xml looks like this:
   <field name="id" type="string" indexed="true" stored="true" required="true" 
termVectors="true" /> 

   <field name="sku" type="textTight" indexed="true" stored="true" 
omitNorms="true" termVectors="true"/>

   <field name="name" type="textgen" indexed="true" stored="true" 
termVectors="true"/>

   <field name="alphaNameSort" type="alphaOnlySort" indexed="true" 
stored="false" termVectors="true"/>

   <field name="manu" type="textgen" indexed="true" stored="true" 
omitNorms="true" termVectors="true"/>

   <field name="cat" type="text_ws" indexed="true" stored="true" 
multiValued="true" omitNorms="true" termVectors="true" />

   <field name="features" type="text" indexed="true" stored="true" 
multiValued="true"/>

   <field name="includes" type="text" indexed="true" stored="true" 
termVectors="true" termPositions="true" termOffsets="true" />



   <field name="weight" type="float" indexed="true" stored="true" 
termVectors="true"/>

   <field name="price"  type="float" indexed="true" stored="true" 
termVectors="true"/>

   <field name="popularity" type="int" indexed="true" stored="true" 
termVectors="true"/>

   <field name="inStock" type="boolean" indexed="true" stored="true" 
termVectors="true"/>





   <!-- Common metadata fields, named specifically to match up with

     SolrCell metadata when parsing rich documents such as Word, PDF.

     Some fields are multiValued only because Tika currently may return

     multiple values for them.

   -->

   <field name="title" type="text" indexed="true" stored="true" 
multiValued="true" termVectors="true"/>

   <field name="subject" type="text" indexed="true" stored="true" 
termVectors="true"/>

   <field name="description" type="text" indexed="true" stored="true" 
termVectors="true"/>

   <field name="comments" type="text" indexed="true" stored="true" 
termVectors="true"/>

   <field name="author" type="textgen" indexed="true" stored="true" 
termVectors="true"/>

   <field name="keywords" type="textgen" indexed="true" stored="true" 
termVectors="true"/>

   <field name="category" type="textgen" indexed="true" stored="true" 
termVectors="true"/>

   <field name="content_type" type="string" indexed="true" stored="true" 
multiValued="true" termVectors="true"/>

   <field name="last_modified" type="date" indexed="true" stored="true" 
termVectors="true"/>

   <field name="links" type="string" indexed="true" stored="true" 
multiValued="true" termVectors="true"/>





   <!-- catchall field, containing all other searchable text fields (implemented

        via copyField further on in this schema  -->

   <field name="text" type="text" indexed="true" stored="false" 
multiValued="true"/>



   <!-- catchall text field that indexes tokens both normally and in reverse 
for efficient

        leading wildcard queries. -->

   <field name="text_rev" type="text_rev" indexed="true" stored="false" 
multiValued="true"/>



   <!-- non-tokenized version of manufacturer to make it easier to sort or group

        results by manufacturer.  copied from "manu" via copyField -->

   <field name="manu_exact" type="string" indexed="true" stored="false"/>



   <field name="payloads" type="payloads" indexed="true" stored="true"/>



   <!-- Uncommenting the following will create a "timestamp" field using

        a default value of "NOW" to indicate when each document was indexed.

     -->

   <!--

   <field name="timestamp" type="date" indexed="true" stored="true" 
default="NOW" multiValued="false"/>

     -->

   



   <!-- Dynamic field definitions.  If a field name is not found, dynamicFields

        will be used if the name matches any of the patterns.

        RESTRICTION: the glob-like pattern in the name attribute must have

        a "*" only at the start or the end.

        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, 
z_i)

        Longer patterns will be matched first.  if equal size patterns

        both match, the first appearing in the schema will be used.  -->

   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>

   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>

   <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>

   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>

   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>

   <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>

   <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>

   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true" 
termVectors="true"/>

This is the output I'm getting:
hadoop@dahlia:/home/sai/project/mahout-distribution-0.4$ bin/mahout 
lucene.vector --dir 
/home/sai/project/apache-solr-1.4.1/example/solr/data/index/ --output 
/home/sai/project/output/part-out.vec --field title-clustering --idField id 
--dictOut /home/sai/project/output/dict.out --norm 2
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
11/08/11 14:55:15 INFO lucene.Driver: Output File: 
/home/sai/project/output/part-out.vec
11/08/11 14:55:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/08/11 14:55:16 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
11/08/11 14:55:16 INFO compress.CodecPool: Got brand-new compressor
11/08/11 14:55:16 INFO lucene.Driver: Wrote: 0 vectors
11/08/11 14:55:16 INFO lucene.Driver: Dictionary Output file: 
/home/sai/project/output/dict.out
11/08/11 14:55:16 INFO driver.MahoutDriver: Program took 1078 ms


sai

Problem creating mahout vectors from solr index

Reply via email to