Re: problems indexing web content

Markus Jelsma Mon, 28 Mar 2011 10:39:39 -0700

Also, don't forget to encode entities or wrap them in CDATA.


> Jan,
> 
> thank you for such a quick reply. I have a feed coming in that I convert to
> an <add><doc></doc><doc></doc> Here is the type for text including index
> and query with the changes suggested.
> 
> 
>         <fieldtype name="text" class="solr.TextField"
> positionIncrementGap="100"> <analyzer type="index">
>                 <charfilter class="solr.HTMLStripCharFilterFactory"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>             <analyzer type="query">
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>         </fieldtype>
> 
> 
> Here is the snippit of the file I generate.
> 
> ?xml version="1.0" encoding="UTF-8"?>
> <add>
> <doc>
> <field
> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
> <field name="title">E X I T</field>
> <field name="authorName">uswautis (Hasanah Uswa)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/uswautis</field>
> <field name="lang">U</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie
> ld> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">E X I T</field>
> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> <doc>
> <field
> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080<
> /field> <field name="title">I want the sweater i saw in mango sooooo
> bad.</field> <field name="authorName">imsuperangelica (angelica
> marie)</field>
> <field name="authorEmail"></field>
> <field name="authorLinkMimeType"></field>
> <field name="authorLink">http://twitter.com/imsuperangelica</field>
> <field name="lang">en</field>
> <field name="publishDate">2011-03-27T13:21:52Z</field>
> <field name="aquiDate">2011-03-27T13:22:13Z</field>
> <field name="source"></field>
> <field
> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620
> 80</field> <field name="feedContentMimeType">text/html</field>
> <field name="feedContentEncoding"></field>
> <field name="feedContent">null</field>
> <field name="inboundLinks">0</field>
> <field name="publisherType">MICROBLOG</field>
> <field name="postTitle">I want the sweater i saw in mango sooooo
> bad.</field> <field name="postBodyMimeType">text/html</field>
> <field name="postBodyEncoding">zlib</field>
> <field name="postBody">mime_type: "text/html"
> data: ""
> </field>
> <field name="tags">[]</field>
> </doc>
> 
> </add>
> 
> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
> > Hi,
> > 
> > I assume you try to post HTML files from post.jar, and use
> > HTMLStripCharFilter to sanitize the HTML.
> > 
> > But you refer to "my file" as if you have multiple docs in one file? XML
> > or HTML? Multiple files? To what UpdateRequestHandler are you posting?
> > /update/xml or /update/extract ? For us to understand what you're trying
> > to achieve, please describe your project in more detail.
> > 
> > 
> > To give some concrete feedback too: First off, your analyzer for "text"
> > is wrong. All charFilter's need to be before the tokenizer. You also
> > lack an analyzer with type="query". If I were you I'd try the simplest
> > case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
> > and Stemmer - just do the most basic stuff you can and go from there.
> > 
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > 
> > On 28. mars 2011, at 18.52, Charles Wardell wrote:
> >> Hi Everyone,
> >> 
> >> I setup a server and began to index my data. I have two questions I am
> >> hoping someone can help me with. Many of my files seem to index without
> >> any problems. Others, I get a host of different errors. I am indexing
> >> primarily web based content and have identified my text field as
> >> follows:
> >> 
> >> <fieldtype name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >> 
> >>           <analyzer type="index">
> >>           
> >>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>               <charFilter class="solr.MappingCharFilterFactory"
> >>               mapping="mapping.txt"/> <charfilter
> >>               class="solr.HTMLStripCharFilterFactory"/> <filter
> >>               class="solr.StopFilterFactory" ignoreCase="true"
> >>               words="stopwords.txt"/> <filter
> >>               class="solr.WordDelimiterFilterFactory"
> >>               generateWordParts="1" generateNumberParts="1"
> >>               catenateWords="1" catenateNumbers="1" catenateAll="0"/>
> >>               <filter class="solr.LowerCaseFilterFactory"/>
> >>               <filter class="solr.EnglishPorterFilterFactory"
> >>               protected="protwords.txt"/> <filter
> >>               class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>           
> >>           </analyzer>
> >>       
> >>       </fieldtype>
> >> 
> >> q1) Errors while indexing.
> >> 
> >> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
> >> status="0"></result>' does not contain '<int name="status">0</int>'
> >> 
> >> * SEVERE: Error processing "legacy" update
> >> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
> >> character ' ' (code 32) in content after '<' (malformed start
> >> element?). at [row,col {unknown-source}]: [1591,90] at
> >> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64
> >> 8)
> >> 
> >> * Although I can't find the actual error, I recall solr giving me an
> >> error when it came across a string &What - The error was something like
> >> expecting semicolon after "What"
> >> 
> >> 
> >> q2) If my file has 1000 documents and I submit it with post.jar, if it
> >> comes across any of the above errors, will it break the processing of
> >> the whole file, or just the document with the error?
> >> 
> >> 
> >> Thanks in advance.
> >> Your help is very much appreciated.
> >> 
> >> Charlie

Re: problems indexing web content

Reply via email to