I have solr version 1.4.1 and mahout version 0.4
I have created index from the xml files given in exampledocs directory. It is
working fine. Queries are also working on those indexes. But when I'm trying to
create mahout vectors from the solr index it is giving message wrote: 0 vectors
I've set termVectors="true" in the schema.xml file for all the fields.
Is there any configuration settings I'm missing?
write now I'm posting only two files solr.xml & payload.xml from
example/solr/conf directory for trial.
portion of my schema.xml looks like this:
<field name="id" type="string" indexed="true" stored="true" required="true"
termVectors="true" />
<field name="sku" type="textTight" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>
<field name="name" type="textgen" indexed="true" stored="true"
termVectors="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true"
stored="false" termVectors="true"/>
<field name="manu" type="textgen" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>
<field name="cat" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" termVectors="true" />
<field name="features" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="includes" type="text" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true" />
<field name="weight" type="float" indexed="true" stored="true"
termVectors="true"/>
<field name="price" type="float" indexed="true" stored="true"
termVectors="true"/>
<field name="popularity" type="int" indexed="true" stored="true"
termVectors="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true"
termVectors="true"/>
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them.
-->
<field name="title" type="text" indexed="true" stored="true"
multiValued="true" termVectors="true"/>
<field name="subject" type="text" indexed="true" stored="true"
termVectors="true"/>
<field name="description" type="text" indexed="true" stored="true"
termVectors="true"/>
<field name="comments" type="text" indexed="true" stored="true"
termVectors="true"/>
<field name="author" type="textgen" indexed="true" stored="true"
termVectors="true"/>
<field name="keywords" type="textgen" indexed="true" stored="true"
termVectors="true"/>
<field name="category" type="textgen" indexed="true" stored="true"
termVectors="true"/>
<field name="content_type" type="string" indexed="true" stored="true"
multiValued="true" termVectors="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"
termVectors="true"/>
<field name="links" type="string" indexed="true" stored="true"
multiValued="true" termVectors="true"/>
<!-- catchall field, containing all other searchable text fields (implemented
via copyField further on in this schema -->
<field name="text" type="text" indexed="true" stored="false"
multiValued="true"/>
<!-- catchall text field that indexes tokens both normally and in reverse
for efficient
leading wildcard queries. -->
<field name="text_rev" type="text_rev" indexed="true" stored="false"
multiValued="true"/>
<!-- non-tokenized version of manufacturer to make it easier to sort or group
results by manufacturer. copied from "manu" via copyField -->
<field name="manu_exact" type="string" indexed="true" stored="false"/>
<field name="payloads" type="payloads" indexed="true" stored="true"/>
<!-- Uncommenting the following will create a "timestamp" field using
a default value of "NOW" to indicate when each document was indexed.
-->
<!--
<field name="timestamp" type="date" indexed="true" stored="true"
default="NOW" multiValued="false"/>
-->
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i,
z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"
termVectors="true"/>
This is the output I'm getting:
hadoop@dahlia:/home/sai/project/mahout-distribution-0.4$ bin/mahout
lucene.vector --dir
/home/sai/project/apache-solr-1.4.1/example/solr/data/index/ --output
/home/sai/project/output/part-out.vec --field title-clustering --idField id
--dictOut /home/sai/project/output/dict.out --norm 2
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
HADOOP_CONF_DIR=/usr/local/hadoop/conf
11/08/11 14:55:15 INFO lucene.Driver: Output File:
/home/sai/project/output/part-out.vec
11/08/11 14:55:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/08/11 14:55:16 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/08/11 14:55:16 INFO compress.CodecPool: Got brand-new compressor
11/08/11 14:55:16 INFO lucene.Driver: Wrote: 0 vectors
11/08/11 14:55:16 INFO lucene.Driver: Dictionary Output file:
/home/sai/project/output/dict.out
11/08/11 14:55:16 INFO driver.MahoutDriver: Program took 1078 ms
sai