Also, don't forget to encode entities or wrap them in CDATA.
> Jan, > > thank you for such a quick reply. I have a feed coming in that I convert to > an <add><doc></doc><doc></doc> Here is the type for text including index > and query with the changes suggested. > > > <fieldtype name="text" class="solr.TextField" > positionIncrementGap="100"> <analyzer type="index"> > <charfilter class="solr.HTMLStripCharFilterFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> <filter > class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer > class="solr.WhitespaceTokenizerFactory"/> </analyzer> > <analyzer type="query"> > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter > class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" > catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" > protected="protwords.txt"/> <filter > class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer > class="solr.WhitespaceTokenizerFactory"/> </analyzer> > </fieldtype> > > > Here is the snippit of the file I generate. > > ?xml version="1.0" encoding="UTF-8"?> > <add> > <doc> > <field > name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field> > <field name="title">E X I T</field> > <field name="authorName">uswautis (Hasanah Uswa)</field> > <field name="authorEmail"></field> > <field name="authorLinkMimeType"></field> > <field name="authorLink">http://twitter.com/uswautis</field> > <field name="lang">U</field> > <field name="publishDate">2011-03-27T13:21:52Z</field> > <field name="aquiDate">2011-03-27T13:22:13Z</field> > <field name="source"></field> > <field > name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie > ld> <field name="feedContentMimeType">text/html</field> > <field name="feedContentEncoding"></field> > <field name="feedContent">null</field> > <field name="inboundLinks">0</field> > <field name="publisherType">MICROBLOG</field> > <field name="postTitle">E X I T</field> > <field name="postBodyMimeType">text/html</field> > <field name="postBodyEncoding">zlib</field> > <field name="postBody">mime_type: "text/html" > data: "" > </field> > <field name="tags">[]</field> > </doc> > > <doc> > <field > name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080< > /field> <field name="title">I want the sweater i saw in mango sooooo > bad.</field> <field name="authorName">imsuperangelica (angelica > marie)</field> > <field name="authorEmail"></field> > <field name="authorLinkMimeType"></field> > <field name="authorLink">http://twitter.com/imsuperangelica</field> > <field name="lang">en</field> > <field name="publishDate">2011-03-27T13:21:52Z</field> > <field name="aquiDate">2011-03-27T13:22:13Z</field> > <field name="source"></field> > <field > name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620 > 80</field> <field name="feedContentMimeType">text/html</field> > <field name="feedContentEncoding"></field> > <field name="feedContent">null</field> > <field name="inboundLinks">0</field> > <field name="publisherType">MICROBLOG</field> > <field name="postTitle">I want the sweater i saw in mango sooooo > bad.</field> <field name="postBodyMimeType">text/html</field> > <field name="postBodyEncoding">zlib</field> > <field name="postBody">mime_type: "text/html" > data: "" > </field> > <field name="tags">[]</field> > </doc> > > </add> > > On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote: > > Hi, > > > > I assume you try to post HTML files from post.jar, and use > > HTMLStripCharFilter to sanitize the HTML. > > > > But you refer to "my file" as if you have multiple docs in one file? XML > > or HTML? Multiple files? To what UpdateRequestHandler are you posting? > > /update/xml or /update/extract ? For us to understand what you're trying > > to achieve, please describe your project in more detail. > > > > > > To give some concrete feedback too: First off, your analyzer for "text" > > is wrong. All charFilter's need to be before the tokenizer. You also > > lack an analyzer with type="query". If I were you I'd try the simplest > > case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter > > and Stemmer - just do the most basic stuff you can and go from there. > > > > -- > > Jan Høydahl, search solution architect > > Cominvent AS - www.cominvent.com > > > > On 28. mars 2011, at 18.52, Charles Wardell wrote: > >> Hi Everyone, > >> > >> I setup a server and began to index my data. I have two questions I am > >> hoping someone can help me with. Many of my files seem to index without > >> any problems. Others, I get a host of different errors. I am indexing > >> primarily web based content and have identified my text field as > >> follows: > >> > >> <fieldtype name="text" class="solr.TextField" > >> positionIncrementGap="100"> > >> > >> <analyzer type="index"> > >> > >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >> <charFilter class="solr.MappingCharFilterFactory" > >> mapping="mapping.txt"/> <charfilter > >> class="solr.HTMLStripCharFilterFactory"/> <filter > >> class="solr.StopFilterFactory" ignoreCase="true" > >> words="stopwords.txt"/> <filter > >> class="solr.WordDelimiterFilterFactory" > >> generateWordParts="1" generateNumberParts="1" > >> catenateWords="1" catenateNumbers="1" catenateAll="0"/> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> <filter class="solr.EnglishPorterFilterFactory" > >> protected="protwords.txt"/> <filter > >> class="solr.RemoveDuplicatesTokenFilterFactory"/> > >> > >> </analyzer> > >> > >> </fieldtype> > >> > >> q1) Errors while indexing. > >> > >> * SimplePostTool: WARNING: Unexpected response from Solr: '<result > >> status="0"></result>' does not contain '<int name="status">0</int>' > >> > >> * SEVERE: Error processing "legacy" update > >> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected > >> character ' ' (code 32) in content after '<' (malformed start > >> element?). at [row,col {unknown-source}]: [1591,90] at > >> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64 > >> 8) > >> > >> * Although I can't find the actual error, I recall solr giving me an > >> error when it came across a string &What - The error was something like > >> expecting semicolon after "What" > >> > >> > >> q2) If my file has 1000 documents and I submit it with post.jar, if it > >> comes across any of the above errors, will it break the processing of > >> the whole file, or just the document with the error? > >> > >> > >> Thanks in advance. > >> Your help is very much appreciated. > >> > >> Charlie