Re: problems indexing web content

Charles Wardell Mon, 28 Mar 2011 10:27:34 -0700

Jan,

thank you for such a quick reply. I have a feed coming in that I convert to an 
<add><doc></doc><doc></doc>
Here is the type for text including index and query with the changes suggested.



        <fieldtype name="text" class="solr.TextField" 
positionIncrementGap="100">
            <analyzer type="index">
                <charfilter class="solr.HTMLStripCharFilterFactory"/>   
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            </analyzer>
            <analyzer type="query">
                <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory" 
protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            </analyzer>
        </fieldtype>


Here is the snippit of the file I generate.

?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field 
name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
<field name="title">E X I T</field>
<field name="authorName">uswautis (Hasanah Uswa)</field>
<field name="authorEmail"></field>
<field name="authorLinkMimeType"></field>
<field name="authorLink">http://twitter.com/uswautis</field>
<field name="lang">U</field>
<field name="publishDate">2011-03-27T13:21:52Z</field>
<field name="aquiDate">2011-03-27T13:22:13Z</field>
<field name="source"></field>
<field 
name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</field>
<field name="feedContentMimeType">text/html</field>
<field name="feedContentEncoding"></field>
<field name="feedContent">null</field>
<field name="inboundLinks">0</field>
<field name="publisherType">MICROBLOG</field>
<field name="postTitle">E X I T</field>
<field name="postBodyMimeType">text/html</field>
<field name="postBodyEncoding">zlib</field>
<field name="postBody">mime_type: "text/html"
data: ""
</field>
<field name="tags">[]</field>
</doc>

<doc>
<field 
name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080</field>
<field name="title">I want the sweater i saw in mango sooooo bad.</field>
<field name="authorName">imsuperangelica (angelica marie)</field>
<field name="authorEmail"></field>
<field name="authorLinkMimeType"></field>
<field name="authorLink">http://twitter.com/imsuperangelica</field>
<field name="lang">en</field>
<field name="publishDate">2011-03-27T13:21:52Z</field>
<field name="aquiDate">2011-03-27T13:22:13Z</field>
<field name="source"></field>
<field 
name="feedURL">http://twitter.com/imsuperangelica/statuses/51997364050862080</field>
<field name="feedContentMimeType">text/html</field>
<field name="feedContentEncoding"></field>
<field name="feedContent">null</field>
<field name="inboundLinks">0</field>
<field name="publisherType">MICROBLOG</field>
<field name="postTitle">I want the sweater i saw in mango sooooo bad.</field>
<field name="postBodyMimeType">text/html</field>
<field name="postBodyEncoding">zlib</field>
<field name="postBody">mime_type: "text/html"
data: ""
</field>
<field name="tags">[]</field>
</doc>

</add>








On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:

> Hi,
> 
> I assume you try to post HTML files from post.jar, and use 
> HTMLStripCharFilter to sanitize the HTML.
> 
> But you refer to "my file" as if you have multiple docs in one file? XML or 
> HTML? Multiple files?
> To what UpdateRequestHandler are you posting? /update/xml or /update/extract ?
> For us to understand what you're trying to achieve, please describe your 
> project in more detail.
> 
> 
> To give some concrete feedback too: First off, your analyzer for "text" is 
> wrong. All charFilter's need to be before the tokenizer. You also lack an 
> analyzer with type="query". If I were you I'd try the simplest case first, 
> get rid of mappingCharFilter, StopFilter, WordDelimFilter and Stemmer - just 
> do the most basic stuff you can and go from there.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> On 28. mars 2011, at 18.52, Charles Wardell wrote:
> 
>> Hi Everyone,
>> 
>> I setup a server and began to index my data. I have two questions I am 
>> hoping someone can help me with. Many of my files seem to index without any 
>> problems. Others, I get a host of different errors. I am indexing primarily 
>> web based content and have identified my text field as follows:
>> 
>> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>>           <analyzer type="index">
>>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>               <charFilter class="solr.MappingCharFilterFactory" 
>> mapping="mapping.txt"/>
>>               <charfilter class="solr.HTMLStripCharFilterFactory"/>  
>>               <filter class="solr.StopFilterFactory" ignoreCase="true" 
>> words="stopwords.txt"/>
>>               <filter class="solr.WordDelimiterFilterFactory" 
>> generateWordParts="1" generateNumberParts="1" catenateWords="1" 
>> catenateNumbers="1" catenateAll="0"/>
>>               <filter class="solr.LowerCaseFilterFactory"/>
>>               <filter class="solr.EnglishPorterFilterFactory" 
>> protected="protwords.txt"/>
>>               <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>           </analyzer>
>>       </fieldtype>
>> 
>> 
>> q1) Errors while indexing.
>> 
>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result 
>> status="0"></result>' does not contain '<int name="status">0</int>'
>> 
>> * SEVERE: Error processing "legacy" update 
>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ' 
>> ' (code 32) in content after '<' (malformed start element?). at [row,col 
>> {unknown-source}]: [1591,90] at 
>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:648)
>> 
>> * Although I can't find the actual error, I recall solr giving me an error 
>> when it came across a string &What - The error was something like expecting 
>> semicolon after "What"
>> 
>> 
>> q2) If my file has 1000 documents and I submit it with post.jar, if it comes 
>> across any of the above errors, will it break the processing of the whole 
>> file, or just the document with the error?
>> 
>> 
>> Thanks in advance. 
>> Your help is very much appreciated.
>> 
>> Charlie
>> 
>

Re: problems indexing web content

Reply via email to