Jan, thank you for such a quick reply. I have a feed coming in that I convert to an <add><doc></doc><doc></doc> Here is the type for text including index and query with the changes suggested.
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <charfilter class="solr.HTMLStripCharFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> <analyzer type="query"> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldtype> Here is the snippit of the file I generate. ?xml version="1.0" encoding="UTF-8"?> <add> <doc> <field name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field> <field name="title">E X I T</field> <field name="authorName">uswautis (Hasanah Uswa)</field> <field name="authorEmail"></field> <field name="authorLinkMimeType"></field> <field name="authorLink">http://twitter.com/uswautis</field> <field name="lang">U</field> <field name="publishDate">2011-03-27T13:21:52Z</field> <field name="aquiDate">2011-03-27T13:22:13Z</field> <field name="source"></field> <field name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</field> <field name="feedContentMimeType">text/html</field> <field name="feedContentEncoding"></field> <field name="feedContent">null</field> <field name="inboundLinks">0</field> <field name="publisherType">MICROBLOG</field> <field name="postTitle">E X I T</field> <field name="postBodyMimeType">text/html</field> <field name="postBodyEncoding">zlib</field> <field name="postBody">mime_type: "text/html" data: "" </field> <field name="tags">[]</field> </doc> <doc> <field name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080</field> <field name="title">I want the sweater i saw in mango sooooo bad.</field> <field name="authorName">imsuperangelica (angelica marie)</field> <field name="authorEmail"></field> <field name="authorLinkMimeType"></field> <field name="authorLink">http://twitter.com/imsuperangelica</field> <field name="lang">en</field> <field name="publishDate">2011-03-27T13:21:52Z</field> <field name="aquiDate">2011-03-27T13:22:13Z</field> <field name="source"></field> <field name="feedURL">http://twitter.com/imsuperangelica/statuses/51997364050862080</field> <field name="feedContentMimeType">text/html</field> <field name="feedContentEncoding"></field> <field name="feedContent">null</field> <field name="inboundLinks">0</field> <field name="publisherType">MICROBLOG</field> <field name="postTitle">I want the sweater i saw in mango sooooo bad.</field> <field name="postBodyMimeType">text/html</field> <field name="postBodyEncoding">zlib</field> <field name="postBody">mime_type: "text/html" data: "" </field> <field name="tags">[]</field> </doc> </add> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote: > Hi, > > I assume you try to post HTML files from post.jar, and use > HTMLStripCharFilter to sanitize the HTML. > > But you refer to "my file" as if you have multiple docs in one file? XML or > HTML? Multiple files? > To what UpdateRequestHandler are you posting? /update/xml or /update/extract ? > For us to understand what you're trying to achieve, please describe your > project in more detail. > > > To give some concrete feedback too: First off, your analyzer for "text" is > wrong. All charFilter's need to be before the tokenizer. You also lack an > analyzer with type="query". If I were you I'd try the simplest case first, > get rid of mappingCharFilter, StopFilter, WordDelimFilter and Stemmer - just > do the most basic stuff you can and go from there. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > On 28. mars 2011, at 18.52, Charles Wardell wrote: > >> Hi Everyone, >> >> I setup a server and began to index my data. I have two questions I am >> hoping someone can help me with. Many of my files seem to index without any >> problems. Others, I get a host of different errors. I am indexing primarily >> web based content and have identified my text field as follows: >> >> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <charFilter class="solr.MappingCharFilterFactory" >> mapping="mapping.txt"/> >> <charfilter class="solr.HTMLStripCharFilterFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt"/> >> <filter class="solr.WordDelimiterFilterFactory" >> generateWordParts="1" generateNumberParts="1" catenateWords="1" >> catenateNumbers="1" catenateAll="0"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.EnglishPorterFilterFactory" >> protected="protwords.txt"/> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> </analyzer> >> </fieldtype> >> >> >> q1) Errors while indexing. >> >> * SimplePostTool: WARNING: Unexpected response from Solr: '<result >> status="0"></result>' does not contain '<int name="status">0</int>' >> >> * SEVERE: Error processing "legacy" update >> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ' >> ' (code 32) in content after '<' (malformed start element?). at [row,col >> {unknown-source}]: [1591,90] at >> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:648) >> >> * Although I can't find the actual error, I recall solr giving me an error >> when it came across a string &What - The error was something like expecting >> semicolon after "What" >> >> >> q2) If my file has 1000 documents and I submit it with post.jar, if it comes >> across any of the above errors, will it break the processing of the whole >> file, or just the document with the error? >> >> >> Thanks in advance. >> Your help is very much appreciated. >> >> Charlie >> >