Re: problems indexing web content

Markus Jelsma Mon, 28 Mar 2011 11:39:50 -0700

> I have about 1000 documents per xml file. I am not really doing anything
> with the data other than putting the xml tags around it. So essentially
> the data is okay with the exception of a few documents that are causing
> the errors.
> 
> Let's say document # 47 in the xml file has a problem, is the whole file
> skipped when using post.jar? I will add the CDATA to my xml generator.


I am not sure actually, i never tried, but i think it's thrown away.

> 
> Sometimes the data will come in as a string of pretty funky looking
> characters. I am assuming this is UTF-8. Is there any specialized data
> type I need to declare for this data?

Well, all data needs to be UTF-8 encoded. Anyway, wrong encoded text data is 
just indexed as is and won't throw an error. Except for entities of course.

> 
> One other thing I noticed is that sometimes I may get data in binary
> compreseed format. Like an image or something. Obviously I am not looking
> to index it, but is there a data type this can be stored as in Solr so I
> can retrieve and render easily?

Yes, use the binary field type [1]. You have to base64 encode the data.

[1]: http://lucene.apache.org/solr/api/org/apache/solr/schema/BinaryField.html

> 
> On Mar 28, 2011, at 1:38 PM, Markus Jelsma wrote:
> > Also, don't forget to encode entities or wrap them in CDATA.
> > 
> >> Jan,
> >> 
> >> thank you for such a quick reply. I have a feed coming in that I convert
> >> to an <add><doc></doc><doc></doc> Here is the type for text including
> >> index and query with the changes suggested.
> >> 
> >>        <fieldtype name="text" class="solr.TextField"
> >> 
> >> positionIncrementGap="100"> <analyzer type="index">
> >> 
> >>                <charfilter class="solr.HTMLStripCharFilterFactory"/>
> >>                <filter class="solr.LowerCaseFilterFactory"/>
> >>                <filter class="solr.EnglishPorterFilterFactory"
> >> 
> >> protected="protwords.txt"/> <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> >> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
> >> 
> >>            <analyzer type="query">
> >>            
> >>                <filter class="solr.SynonymFilterFactory"
> >> 
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
> >> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> >> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> >> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
> >> 
> >>                <filter class="solr.EnglishPorterFilterFactory"
> >> 
> >> protected="protwords.txt"/> <filter
> >> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> >> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
> >> 
> >>        </fieldtype>
> >> 
> >> Here is the snippit of the file I generate.
> >> 
> >> ?xml version="1.0" encoding="UTF-8"?>
> >> <add>
> >> <doc>
> >> <field
> >> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</fiel
> >> d> <field name="title">E X I T</field>
> >> <field name="authorName">uswautis (Hasanah Uswa)</field>
> >> <field name="authorEmail"></field>
> >> <field name="authorLinkMimeType"></field>
> >> <field name="authorLink">http://twitter.com/uswautis</field>
> >> <field name="lang">U</field>
> >> <field name="publishDate">2011-03-27T13:21:52Z</field>
> >> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> >> <field name="source"></field>
> >> <field
> >> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</f
> >> ie ld> <field name="feedContentMimeType">text/html</field>
> >> <field name="feedContentEncoding"></field>
> >> <field name="feedContent">null</field>
> >> <field name="inboundLinks">0</field>
> >> <field name="publisherType">MICROBLOG</field>
> >> <field name="postTitle">E X I T</field>
> >> <field name="postBodyMimeType">text/html</field>
> >> <field name="postBodyEncoding">zlib</field>
> >> <field name="postBody">mime_type: "text/html"
> >> data: ""
> >> </field>
> >> <field name="tags">[]</field>
> >> </doc>
> >> 
> >> <doc>
> >> <field
> >> name="guid">http://twitter.com/imsuperangelica/statuses/5199736405086208
> >> 0< /field> <field name="title">I want the sweater i saw in mango sooooo
> >> bad.</field> <field name="authorName">imsuperangelica (angelica
> >> marie)</field>
> >> <field name="authorEmail"></field>
> >> <field name="authorLinkMimeType"></field>
> >> <field name="authorLink">http://twitter.com/imsuperangelica</field>
> >> <field name="lang">en</field>
> >> <field name="publishDate">2011-03-27T13:21:52Z</field>
> >> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> >> <field name="source"></field>
> >> <field
> >> name="feedURL">http://twitter.com/imsuperangelica/statuses/5199736405086
> >> 20 80</field> <field name="feedContentMimeType">text/html</field>
> >> <field name="feedContentEncoding"></field>
> >> <field name="feedContent">null</field>
> >> <field name="inboundLinks">0</field>
> >> <field name="publisherType">MICROBLOG</field>
> >> <field name="postTitle">I want the sweater i saw in mango sooooo
> >> bad.</field> <field name="postBodyMimeType">text/html</field>
> >> <field name="postBodyEncoding">zlib</field>
> >> <field name="postBody">mime_type: "text/html"
> >> data: ""
> >> </field>
> >> <field name="tags">[]</field>
> >> </doc>
> >> 
> >> </add>
> >> 
> >> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
> >>> Hi,
> >>> 
> >>> I assume you try to post HTML files from post.jar, and use
> >>> HTMLStripCharFilter to sanitize the HTML.
> >>> 
> >>> But you refer to "my file" as if you have multiple docs in one file?
> >>> XML or HTML? Multiple files? To what UpdateRequestHandler are you
> >>> posting? /update/xml or /update/extract ? For us to understand what
> >>> you're trying to achieve, please describe your project in more detail.
> >>> 
> >>> 
> >>> To give some concrete feedback too: First off, your analyzer for "text"
> >>> is wrong. All charFilter's need to be before the tokenizer. You also
> >>> lack an analyzer with type="query". If I were you I'd try the simplest
> >>> case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
> >>> and Stemmer - just do the most basic stuff you can and go from there.
> >>> 
> >>> --
> >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>> 
> >>> On 28. mars 2011, at 18.52, Charles Wardell wrote:
> >>>> Hi Everyone,
> >>>> 
> >>>> I setup a server and began to index my data. I have two questions I am
> >>>> hoping someone can help me with. Many of my files seem to index
> >>>> without any problems. Others, I get a host of different errors. I am
> >>>> indexing primarily web based content and have identified my text
> >>>> field as follows:
> >>>> 
> >>>> <fieldtype name="text" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>> 
> >>>>          <analyzer type="index">
> >>>>          
> >>>>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>              <charFilter class="solr.MappingCharFilterFactory"
> >>>>              mapping="mapping.txt"/> <charfilter
> >>>>              class="solr.HTMLStripCharFilterFactory"/> <filter
> >>>>              class="solr.StopFilterFactory" ignoreCase="true"
> >>>>              words="stopwords.txt"/> <filter
> >>>>              class="solr.WordDelimiterFilterFactory"
> >>>>              generateWordParts="1" generateNumberParts="1"
> >>>>              catenateWords="1" catenateNumbers="1" catenateAll="0"/>
> >>>>              <filter class="solr.LowerCaseFilterFactory"/>
> >>>>              <filter class="solr.EnglishPorterFilterFactory"
> >>>>              protected="protwords.txt"/> <filter
> >>>>              class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>>          
> >>>>          </analyzer>
> >>>>      
> >>>>      </fieldtype>
> >>>> 
> >>>> q1) Errors while indexing.
> >>>> 
> >>>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
> >>>> status="0"></result>' does not contain '<int name="status">0</int>'
> >>>> 
> >>>> * SEVERE: Error processing "legacy" update
> >>>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> >>>> character ' ' (code 32) in content after '<' (malformed start
> >>>> element?). at [row,col {unknown-source}]: [1591,90] at
> >>>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:6
> >>>> 4 8)
> >>>> 
> >>>> * Although I can't find the actual error, I recall solr giving me an
> >>>> error when it came across a string &What - The error was something
> >>>> like expecting semicolon after "What"
> >>>> 
> >>>> 
> >>>> q2) If my file has 1000 documents and I submit it with post.jar, if it
> >>>> comes across any of the above errors, will it break the processing of
> >>>> the whole file, or just the document with the error?
> >>>> 
> >>>> 
> >>>> Thanks in advance.
> >>>> Your help is very much appreciated.
> >>>> 
> >>>> Charlie

Re: problems indexing web content

Reply via email to