I have about 1000 documents per xml file. I am not really doing anything with the data other than putting the xml tags around it. So essentially the data is okay with the exception of a few documents that are causing the errors.
Let's say document # 47 in the xml file has a problem, is the whole file skipped when using post.jar? I will add the CDATA to my xml generator. Sometimes the data will come in as a string of pretty funky looking characters. I am assuming this is UTF-8. Is there any specialized data type I need to declare for this data? One other thing I noticed is that sometimes I may get data in binary compreseed format. Like an image or something. Obviously I am not looking to index it, but is there a data type this can be stored as in Solr so I can retrieve and render easily? On Mar 28, 2011, at 1:38 PM, Markus Jelsma wrote: > Also, don't forget to encode entities or wrap them in CDATA. > >> Jan, >> >> thank you for such a quick reply. I have a feed coming in that I convert to >> an <add><doc></doc><doc></doc> Here is the type for text including index >> and query with the changes suggested. >> >> >> <fieldtype name="text" class="solr.TextField" >> positionIncrementGap="100"> <analyzer type="index"> >> <charfilter class="solr.HTMLStripCharFilterFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.EnglishPorterFilterFactory" >> protected="protwords.txt"/> <filter >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer >> class="solr.WhitespaceTokenizerFactory"/> </analyzer> >> <analyzer type="query"> >> <filter class="solr.SynonymFilterFactory" >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter >> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" >> generateNumberParts="1" catenateWords="0" catenateNumbers="0" >> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.EnglishPorterFilterFactory" >> protected="protwords.txt"/> <filter >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer >> class="solr.WhitespaceTokenizerFactory"/> </analyzer> >> </fieldtype> >> >> >> Here is the snippit of the file I generate. >> >> ?xml version="1.0" encoding="UTF-8"?> >> <add> >> <doc> >> <field >> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field> >> <field name="title">E X I T</field> >> <field name="authorName">uswautis (Hasanah Uswa)</field> >> <field name="authorEmail"></field> >> <field name="authorLinkMimeType"></field> >> <field name="authorLink">http://twitter.com/uswautis</field> >> <field name="lang">U</field> >> <field name="publishDate">2011-03-27T13:21:52Z</field> >> <field name="aquiDate">2011-03-27T13:22:13Z</field> >> <field name="source"></field> >> <field >> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie >> ld> <field name="feedContentMimeType">text/html</field> >> <field name="feedContentEncoding"></field> >> <field name="feedContent">null</field> >> <field name="inboundLinks">0</field> >> <field name="publisherType">MICROBLOG</field> >> <field name="postTitle">E X I T</field> >> <field name="postBodyMimeType">text/html</field> >> <field name="postBodyEncoding">zlib</field> >> <field name="postBody">mime_type: "text/html" >> data: "" >> </field> >> <field name="tags">[]</field> >> </doc> >> >> <doc> >> <field >> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080< >> /field> <field name="title">I want the sweater i saw in mango sooooo >> bad.</field> <field name="authorName">imsuperangelica (angelica >> marie)</field> >> <field name="authorEmail"></field> >> <field name="authorLinkMimeType"></field> >> <field name="authorLink">http://twitter.com/imsuperangelica</field> >> <field name="lang">en</field> >> <field name="publishDate">2011-03-27T13:21:52Z</field> >> <field name="aquiDate">2011-03-27T13:22:13Z</field> >> <field name="source"></field> >> <field >> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620 >> 80</field> <field name="feedContentMimeType">text/html</field> >> <field name="feedContentEncoding"></field> >> <field name="feedContent">null</field> >> <field name="inboundLinks">0</field> >> <field name="publisherType">MICROBLOG</field> >> <field name="postTitle">I want the sweater i saw in mango sooooo >> bad.</field> <field name="postBodyMimeType">text/html</field> >> <field name="postBodyEncoding">zlib</field> >> <field name="postBody">mime_type: "text/html" >> data: "" >> </field> >> <field name="tags">[]</field> >> </doc> >> >> </add> >> >> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote: >>> Hi, >>> >>> I assume you try to post HTML files from post.jar, and use >>> HTMLStripCharFilter to sanitize the HTML. >>> >>> But you refer to "my file" as if you have multiple docs in one file? XML >>> or HTML? Multiple files? To what UpdateRequestHandler are you posting? >>> /update/xml or /update/extract ? For us to understand what you're trying >>> to achieve, please describe your project in more detail. >>> >>> >>> To give some concrete feedback too: First off, your analyzer for "text" >>> is wrong. All charFilter's need to be before the tokenizer. You also >>> lack an analyzer with type="query". If I were you I'd try the simplest >>> case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter >>> and Stemmer - just do the most basic stuff you can and go from there. >>> >>> -- >>> Jan Høydahl, search solution architect >>> Cominvent AS - www.cominvent.com >>> >>> On 28. mars 2011, at 18.52, Charles Wardell wrote: >>>> Hi Everyone, >>>> >>>> I setup a server and began to index my data. I have two questions I am >>>> hoping someone can help me with. Many of my files seem to index without >>>> any problems. Others, I get a host of different errors. I am indexing >>>> primarily web based content and have identified my text field as >>>> follows: >>>> >>>> <fieldtype name="text" class="solr.TextField" >>>> positionIncrementGap="100"> >>>> >>>> <analyzer type="index"> >>>> >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>>> <charFilter class="solr.MappingCharFilterFactory" >>>> mapping="mapping.txt"/> <charfilter >>>> class="solr.HTMLStripCharFilterFactory"/> <filter >>>> class="solr.StopFilterFactory" ignoreCase="true" >>>> words="stopwords.txt"/> <filter >>>> class="solr.WordDelimiterFilterFactory" >>>> generateWordParts="1" generateNumberParts="1" >>>> catenateWords="1" catenateNumbers="1" catenateAll="0"/> >>>> <filter class="solr.LowerCaseFilterFactory"/> >>>> <filter class="solr.EnglishPorterFilterFactory" >>>> protected="protwords.txt"/> <filter >>>> class="solr.RemoveDuplicatesTokenFilterFactory"/> >>>> >>>> </analyzer> >>>> >>>> </fieldtype> >>>> >>>> q1) Errors while indexing. >>>> >>>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result >>>> status="0"></result>' does not contain '<int name="status">0</int>' >>>> >>>> * SEVERE: Error processing "legacy" update >>>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected >>>> character ' ' (code 32) in content after '<' (malformed start >>>> element?). at [row,col {unknown-source}]: [1591,90] at >>>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64 >>>> 8) >>>> >>>> * Although I can't find the actual error, I recall solr giving me an >>>> error when it came across a string &What - The error was something like >>>> expecting semicolon after "What" >>>> >>>> >>>> q2) If my file has 1000 documents and I submit it with post.jar, if it >>>> comes across any of the above errors, will it break the processing of >>>> the whole file, or just the document with the error? >>>> >>>> >>>> Thanks in advance. >>>> Your help is very much appreciated. >>>> >>>> Charlie