Hi,

Check the following properties in your nutch-site.xml if you have
overridden it. Else you may have to as that determines the amount of
content that is downloaded and parsed. Anything beyond what is mentioned in
these parameters will be truncated by nutch:

<property>
                <name>file.content.limit</name>
                <value>1000000</value>
                <description>The length limit for downloaded content using
the file://
                protocol, in bytes. If this value is nonnegative (>=0),
content longer
                than it will be truncated; otherwise, no truncation at all.
Do not
                confuse this setting with the http.content.limit setting.
                </description>
        </property>

        <property>
                <name>http.content.limit</name>
                <value>1000000</value>
                <description>The length limit for downloaded content using
the http://
                protocol, in bytes. If this value is nonnegative (>=0),
content longer
                than it will be truncated; otherwise, no truncation at all.
Do not
                confuse this setting with the file.content.limit setting.
                </description>
        </property>

Thanks
Chethan


On Mon, Apr 22, 2013 at 4:46 PM, kneerosh <[email protected]>wrote:

> I noticed, when Im crawling a website using Nutch, and indexing it in Solr-
> when I search for words in the content of the page- i wasn't getting
> results
> though I get the title and text from the header.
> I then took the html source and validated it at http://xmlgrid.net/ and
> found that its not well formed- only the head part with the title is
> getting
> recognized and the body is probably not well formed html.
>
> As I cant change the website html, how do I get around this. How do ensure
> nutch collects text from the entire page- even if its not well-formed.
>
> Any other reason I can be missing content on a page, though the page is
> accessed and indexed? any suggestions.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-not-getting-all-content-of-page-tp4057870.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to