Re: problems indexing web content

Charles Wardell Mon, 28 Mar 2011 11:01:24 -0700

I have about 1000 documents per xml file. I am not really doing anything with 
the data other than putting the xml tags around it.
So essentially the data is okay with the exception of a few documents that are 
causing the errors.


Let's say document # 47 in the xml file has a problem, is the whole file 
skipped when using post.jar?
I will add the CDATA to my xml generator.

Sometimes the data will come in as a string of pretty funky looking characters. 
I am assuming this is UTF-8. Is there any specialized data type I need to 
declare for this data?

One other thing I noticed is that sometimes I may get data in binary compreseed 
format. Like an image or something. Obviously I am not looking to index it, but 
is there a data type this can be stored as in Solr so I can retrieve and render 
easily?


On Mar 28, 2011, at 1:38 PM, Markus Jelsma wrote:

> Also, don't forget to encode entities or wrap them in CDATA.
> 
>> Jan,
>> 
>> thank you for such a quick reply. I have a feed coming in that I convert to
>> an <add><doc></doc><doc></doc> Here is the type for text including index
>> and query with the changes suggested.
>> 
>> 
>>        <fieldtype name="text" class="solr.TextField"
>> positionIncrementGap="100"> <analyzer type="index">
>>                <charfilter class="solr.HTMLStripCharFilterFactory"/>
>>                <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/> <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>>            <analyzer type="query">
>>                <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter
>> class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/>
>>                <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/> <filter
>> class="solr.RemoveDuplicatesTokenFilterFactory"/> <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/> </analyzer>
>>        </fieldtype>
>> 
>> 
>> Here is the snippit of the file I generate.
>> 
>> ?xml version="1.0" encoding="UTF-8"?>
>> <add>
>> <doc>
>> <field
>> name="guid">http://twitter.com/uswautis/statuses/51997364122165249</field>
>> <field name="title">E X I T</field>
>> <field name="authorName">uswautis (Hasanah Uswa)</field>
>> <field name="authorEmail"></field>
>> <field name="authorLinkMimeType"></field>
>> <field name="authorLink">http://twitter.com/uswautis</field>
>> <field name="lang">U</field>
>> <field name="publishDate">2011-03-27T13:21:52Z</field>
>> <field name="aquiDate">2011-03-27T13:22:13Z</field>
>> <field name="source"></field>
>> <field
>> name="feedURL">http://twitter.com/uswautis/statuses/51997364122165249</fie
>> ld> <field name="feedContentMimeType">text/html</field>
>> <field name="feedContentEncoding"></field>
>> <field name="feedContent">null</field>
>> <field name="inboundLinks">0</field>
>> <field name="publisherType">MICROBLOG</field>
>> <field name="postTitle">E X I T</field>
>> <field name="postBodyMimeType">text/html</field>
>> <field name="postBodyEncoding">zlib</field>
>> <field name="postBody">mime_type: "text/html"
>> data: ""
>> </field>
>> <field name="tags">[]</field>
>> </doc>
>> 
>> <doc>
>> <field
>> name="guid">http://twitter.com/imsuperangelica/statuses/51997364050862080<
>> /field> <field name="title">I want the sweater i saw in mango sooooo
>> bad.</field> <field name="authorName">imsuperangelica (angelica
>> marie)</field>
>> <field name="authorEmail"></field>
>> <field name="authorLinkMimeType"></field>
>> <field name="authorLink">http://twitter.com/imsuperangelica</field>
>> <field name="lang">en</field>
>> <field name="publishDate">2011-03-27T13:21:52Z</field>
>> <field name="aquiDate">2011-03-27T13:22:13Z</field>
>> <field name="source"></field>
>> <field
>> name="feedURL">http://twitter.com/imsuperangelica/statuses/519973640508620
>> 80</field> <field name="feedContentMimeType">text/html</field>
>> <field name="feedContentEncoding"></field>
>> <field name="feedContent">null</field>
>> <field name="inboundLinks">0</field>
>> <field name="publisherType">MICROBLOG</field>
>> <field name="postTitle">I want the sweater i saw in mango sooooo
>> bad.</field> <field name="postBodyMimeType">text/html</field>
>> <field name="postBodyEncoding">zlib</field>
>> <field name="postBody">mime_type: "text/html"
>> data: ""
>> </field>
>> <field name="tags">[]</field>
>> </doc>
>> 
>> </add>
>> 
>> On Mar 28, 2011, at 1:02 PM, Jan Høydahl wrote:
>>> Hi,
>>> 
>>> I assume you try to post HTML files from post.jar, and use
>>> HTMLStripCharFilter to sanitize the HTML.
>>> 
>>> But you refer to "my file" as if you have multiple docs in one file? XML
>>> or HTML? Multiple files? To what UpdateRequestHandler are you posting?
>>> /update/xml or /update/extract ? For us to understand what you're trying
>>> to achieve, please describe your project in more detail.
>>> 
>>> 
>>> To give some concrete feedback too: First off, your analyzer for "text"
>>> is wrong. All charFilter's need to be before the tokenizer. You also
>>> lack an analyzer with type="query". If I were you I'd try the simplest
>>> case first, get rid of mappingCharFilter, StopFilter, WordDelimFilter
>>> and Stemmer - just do the most basic stuff you can and go from there.
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
>>> On 28. mars 2011, at 18.52, Charles Wardell wrote:
>>>> Hi Everyone,
>>>> 
>>>> I setup a server and began to index my data. I have two questions I am
>>>> hoping someone can help me with. Many of my files seem to index without
>>>> any problems. Others, I get a host of different errors. I am indexing
>>>> primarily web based content and have identified my text field as
>>>> follows:
>>>> 
>>>> <fieldtype name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>> 
>>>>          <analyzer type="index">
>>>> 
>>>>              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>              <charFilter class="solr.MappingCharFilterFactory"
>>>>              mapping="mapping.txt"/> <charfilter
>>>>              class="solr.HTMLStripCharFilterFactory"/> <filter
>>>>              class="solr.StopFilterFactory" ignoreCase="true"
>>>>              words="stopwords.txt"/> <filter
>>>>              class="solr.WordDelimiterFilterFactory"
>>>>              generateWordParts="1" generateNumberParts="1"
>>>>              catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>>>>              <filter class="solr.LowerCaseFilterFactory"/>
>>>>              <filter class="solr.EnglishPorterFilterFactory"
>>>>              protected="protwords.txt"/> <filter
>>>>              class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>> 
>>>>          </analyzer>
>>>> 
>>>>      </fieldtype>
>>>> 
>>>> q1) Errors while indexing.
>>>> 
>>>> * SimplePostTool: WARNING: Unexpected response from Solr: '<result
>>>> status="0"></result>' does not contain '<int name="status">0</int>'
>>>> 
>>>> * SEVERE: Error processing "legacy" update
>>>> command:com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected
>>>> character ' ' (code 32) in content after '<' (malformed start
>>>> element?). at [row,col {unknown-source}]: [1591,90] at
>>>> com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:64
>>>> 8)
>>>> 
>>>> * Although I can't find the actual error, I recall solr giving me an
>>>> error when it came across a string &What - The error was something like
>>>> expecting semicolon after "What"
>>>> 
>>>> 
>>>> q2) If my file has 1000 documents and I submit it with post.jar, if it
>>>> comes across any of the above errors, will it break the processing of
>>>> the whole file, or just the document with the error?
>>>> 
>>>> 
>>>> Thanks in advance.
>>>> Your help is very much appreciated.
>>>> 
>>>> Charlie

Re: problems indexing web content

Reply via email to