Re: How to use nutch 2.2.1 to crawl images

Baizhang Ma Thu, 03 Dec 2015 06:27:13 -0800

Hi Madhav, Thanks for your reply! I have changed the nutch-site.xml but
anything of the nutch-default.xml, here is my nutch.xml,


<configuration>
<property>
    <name>http.agent.name</name>
        <value>MyNutchSpider</value>
    </property>
    <property>
        <name>parser.character.encoding.default</name>
        <value>utf-8</value>
        <description>The character encoding to fall back to when no other
information is available</description>
    </property>
    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.sql.store.SqlStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>http.accept.language</name>
        <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
        <description>Value of the Accept-Language request header field.
        This allows selecting non-English language as default one to
retrieve.
        It is a useful setting for search engines build for certain
national group.
        </description>
    </property>
    <property>
    <name>generate.batch.id</name>
    <value>*</value>
    </property>
    <property>
    <name>plugin.includes</name>

<value>protocol-selenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|protocol-httpclient</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin.
By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please
enable
    protocol-httpclient, but be aware of possible intermittent problems
with the
    underlying commons-httpclient library.
    </description>
    </property>
</configuration>

If you have any idea about this,would very appreciate to share with me!
Thanks again, Madhav.

Best regards,
Byzen.Ma

2015-12-03 13:55 GMT+08:00 Madhav Sharan <[email protected]>:

> Hi Byzen,
>
> I understand you have commented image suffix from regex filter. Can you
> share your nutch-site.xml, regex filter also if you have changed anything
> in nutch-default.xml
>
> --
> Thanks
> Madhav Sharan
>
>
> On Mon, Nov 30, 2015 at 10:15 PM, Baizhang Ma <[email protected]>
> wrote:
>
> > Hi, everyone.
> > I'm a new nutch user and now i want to crawl images from webpages. Now i
> > have excluded images suffix like
> gif|GIF|jpg|JPG|png|PNG|jpeg|JPEG|bmp|BMP
> > in the regex-urlfilter.txt, but it does not work. And my nutch version is
> > 2.2.1, is there anyone kindly to tell me how to do it?  If I need to use
> a
> > plugin, could you tell me what plugin I need and how to configure it as I
> > am quite inexperience about this. Thank you very much.
> >
> > Best regards,
> > Byzen. Ma
> >
>

Re: How to use nutch 2.2.1 to crawl images

Reply via email to