Hi Madhav, Thanks for your reply! I have changed the nutch-site.xml but
anything of the nutch-default.xml, here is my nutch.xml,
<configuration>
<property>
<name>http.agent.name</name>
<value>MyNutchSpider</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other
information is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the Accept-Language request header field.
This allows selecting non-English language as default one to
retrieve.
It is a useful setting for search engines build for certain
national group.
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-selenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|protocol-httpclient</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin.
By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
</configuration>
If you have any idea about this,would very appreciate to share with me!
Thanks again, Madhav.
Best regards,
Byzen.Ma
2015-12-03 13:55 GMT+08:00 Madhav Sharan <[email protected]>:
> Hi Byzen,
>
> I understand you have commented image suffix from regex filter. Can you
> share your nutch-site.xml, regex filter also if you have changed anything
> in nutch-default.xml
>
> --
> Thanks
> Madhav Sharan
>
>
> On Mon, Nov 30, 2015 at 10:15 PM, Baizhang Ma <[email protected]>
> wrote:
>
> > Hi, everyone.
> > I'm a new nutch user and now i want to crawl images from webpages. Now i
> > have excluded images suffix like
> gif|GIF|jpg|JPG|png|PNG|jpeg|JPEG|bmp|BMP
> > in the regex-urlfilter.txt, but it does not work. And my nutch version is
> > 2.2.1, is there anyone kindly to tell me how to do it? If I need to use
> a
> > plugin, could you tell me what plugin I need and how to configure it as I
> > am quite inexperience about this. Thank you very much.
> >
> > Best regards,
> > Byzen. Ma
> >
>