> I have about 1000 documents per xml file. I am not really doing anything > with the data other than putting the xml tags around it. So essentially > the data is okay with the exception of a few documents that are causing > the errors. > > Let's say document # 47 in the xml file has a problem, is the whole file > skipped when using post.jar? I will add the CDATA to my xml generator.
I am not sure actually, i never tried, but i think it's thrown away. > > Sometimes the data will come in as a string of pretty funky looking > characters. I am assuming this is UTF-8. Is there any specialized data > type I need to declare for this data? Well, all data needs to be UTF-8 encoded. Anyway, wrong encoded text data is just indexed as is and won't throw an error. Except for entities of course. > > One other thing I noticed is that sometimes I may get data in binary > compreseed format. Like an image or something. Obviously I am not looking > to index it, but is there a data type this can be stored as in Solr so I > can retrieve and render easily? Yes, use the binary field type [1]. You have to base64 encode the data. [1]: http://lucene.apache.org/solr/api/org/apache/solr/schema/BinaryField.html > > On Mar 28, 2011, at 1:38 PM, Markus Jelsma wrote: > > Also, don't forget to encode entities or wrap them in CDATA. > > > >> Jan, > >> > >> thank you for such a quick reply. I have a feed coming in that I convert > >> to an <add><doc></doc><doc></doc> Here is the type for text including > >> index and query with the changes suggested. > >> > >> <fieldtype name="text" class="solr.TextField" > >> > >> positionIncrementGap="100"> <analyzer type="index"> > >> > >> <charfilter class="solr.HTMLStripCharFilterFactory"/> > >> <filter class="solr.LowerCaseFilterFactory"/> > >> <filter class="solr.EnglishPorterFilterFactory" > >> > >> protected="protwords.txt"/> <filter > >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer > >> class="solr.WhitespaceTokenizerFactory"/> </analyzer> > >> > >> <analyzer type="query"> > >> > >> <filter class="solr.SynonymFilterFactory" > >> > >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter > >> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> > >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > >> generateNumberParts="1" catenateWords="0" catenateNumbers="0" > >> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> > >> > >> <filter class="solr.EnglishPorterFilterFactory" > >> > >> protected="protwords.txt"/> <filter > >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer > >> class="solr.WhitespaceTokenizerFactory"/> </analyzer> > >> > >> </fieldtype> > >> > >> Here is the snippit of the file I generate. > >> > >> ?xml version="1.0" encoding="UTF-8"?> > >> <add> > >> <doc> > >> <field > >> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</fiel > >> d> <field name="title">E X I T</field> > >> <field name="authorName">uswautis (Hasanah Uswa)</field> > >> <field name="authorEmail"></field> > >> <field name="authorLinkMimeType"></field> > >> <field name="authorLink">http://twitter.com/uswautis</field> > >> <field name="lang">U</field> > >> <field name="publishDate">2011-03-27T13:21:52Z</field> > >> <field name="aquiDate">2011-03-27T13:22:13Z</field> > >> <field name="source"></field> > >> <field > >> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</f > >> ie ld> <field name="feedContentMimeType">text/html</field> > >> <field name="feedContentEncoding"></field> > >> <field name="feedContent">null</field> > >> <field name="inboundLinks">0</field> > >> <field name="publisherType">MICROBLOG</field> > >> <field name="postTitle">E X I T</field> > >> <field name="postBodyMimeType">text/html</field> > >> <field name="postBodyEncoding">zlib</field> > >> <field name="postBody">mime_type: "text/html" > >> data: "" > >> </field> > >> <field name="tags">[]</field> > >> </doc> > >> > >> <doc> > >> <field > >> name="guid">http://twitter.com/imsuperangelica/statuses/5199736405086208 > >> 0< /field> <field name="title">I want the sweater i saw in mango sooooo > >> bad.</field> <field name="authorName">imsuperangelica (angelica > >> marie)</field> > >> <field name="authorEmail"></field> > >> <field name="authorLinkMimeType"></field> > >> <field name="authorLink">http://twitter.com/imsuperangelica</field> > >> <field name="lang">en</field> > >> <field name="publishDate">2011-03-27T13:21:52Z</field> > >> <field name="aquiDate">2011-03-27T13:22:13Z</field> > >> <field name="source"></field> > >> <field > >> name="feedURL">http://twitter.com/imsuperangelica/statuses/5199736405086 > >> 20 80</field> <field name="feedContentMimeType">text/html</field> > >> <field name="feedContentEncoding"></field> > >> <field name="feedContent">null</field> > >> <field name="inboundLinks">0</field> > >> <field name="publisherType">MICROBLOG</field> > >> <field name="postTitle">I want the sweater i saw in mango sooooo > >> bad.</field> <field name="postBodyMimeType">text/html</field> > >> <field name="postBodyEncoding">zlib</field> > >> <field name="postBody">mime_type: "text/html" > >> data: "" > >> </field> > >> <field name="tags">[]</field> > >> </doc> > >> > >> </add> > >> > >> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote: > >>> Hi, > >>> > >>> I assume you try to post HTML files from post.jar, and use > >>> HTMLStripCharFilter to sanitize the HTML. > >>> > >>> But you refer to "my file" as if you have multiple docs in one file? > >>> XML or HTML? Multiple files? To what UpdateRequestHandler are you > >>> posting? /update/xml or /update/extract ? For us to understand what > >>> you're trying to achieve, please describe your project in more detail. > >>> > >>> > >>> To give some concrete feedback too: First off, your analyzer for "text" > >>> is wrong. All charFilter's need to be before the tokenizer. You also > >>> lack an analyzer with type="query". If I were you I'd try the simplest > >>> case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter > >>> and Stemmer - just do the most basic stuff you can and go from there. > >>> > >>> -- > >>> Jan Høydahl, search solution architect > >>> Cominvent AS - www.cominvent.com > >>> > >>> On 28. mars 2011, at 18.52, Charles Wardell wrote: > >>>> Hi Everyone, > >>>> > >>>> I setup a server and began to index my data. I have two questions I am > >>>> hoping someone can help me with. Many of my files seem to index > >>>> without any problems. Others, I get a host of different errors. I am > >>>> indexing primarily web based content and have identified my text > >>>> field as follows: > >>>> > >>>> <fieldtype name="text" class="solr.TextField" > >>>> positionIncrementGap="100"> > >>>> > >>>> <analyzer type="index"> > >>>> > >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>>> <charFilter class="solr.MappingCharFilterFactory" > >>>> mapping="mapping.txt"/> <charfilter > >>>> class="solr.HTMLStripCharFilterFactory"/> <filter > >>>> class="solr.StopFilterFactory" ignoreCase="true" > >>>> words="stopwords.txt"/> <filter > >>>> class="solr.WordDelimiterFilterFactory" > >>>> generateWordParts="1" generateNumberParts="1" > >>>> catenateWords="1" catenateNumbers="1" catenateAll="0"/> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> <filter class="solr.EnglishPorterFilterFactory" > >>>> protected="protwords.txt"/> <filter > >>>> class="solr.RemoveDuplicatesTokenFilterFactory"/> > >>>> > >>>> </analyzer> > >>>> > >>>> </fieldtype> > >>>> > >>>> q1) Errors while indexing. > >>>> > >>>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result > >>>> status="0"></result>' does not contain '<int name="status">0</int>' > >>>> > >>>> * SEVERE: Error processing "legacy" update > >>>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected > >>>> character ' ' (code 32) in content after '<' (malformed start > >>>> element?). at [row,col {unknown-source}]: [1591,90] at > >>>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:6 > >>>> 4 8) > >>>> > >>>> * Although I can't find the actual error, I recall solr giving me an > >>>> error when it came across a string &What - The error was something > >>>> like expecting semicolon after "What" > >>>> > >>>> > >>>> q2) If my file has 1000 documents and I submit it with post.jar, if it > >>>> comes across any of the above errors, will it break the processing of > >>>> the whole file, or just the document with the error? > >>>> > >>>> > >>>> Thanks in advance. > >>>> Your help is very much appreciated. > >>>> > >>>> Charlie