Re: Little Help for nutch Newbe...

Markus Jelsma Thu, 02 Dec 2010 04:23:34 -0800

Yes, but it isn't an URL Nutch can discover. Using language in this manner is 
very difficult for crawler etc. Language information should depend on the URL. 
You have different content (en and de) on the same URL and... the URL is a 
unique identifier in the Nutch documents.


I don't see a work-around here but maybe someone else does...

On Thursday 02 December 2010 13:13:46 Klaus Tachtler wrote:
> Hi,
> 
> Yes, but I have following HTTP-Header Information too:
> 
> german-Version:
> 
> <html xmlns="http://www.w3.org/1999/xhtml"; lang="de" xml:lang="de">
> <head>
> ...
> 
> english-Version:
> 
> <html xmlns="http://www.w3.org/1999/xhtml"; lang="en" xml:lang="en">
> <head>
> ...
> 
> Thank you for your answer...
> 
> Klaus.
> 
> > You're storing the language value in your session isn't it? Well, there
> > is your problem.
> > 
> > On Thursday 02 December 2010 12:56:31 Klaus Tachtler wrote:
> >> Hi List,
> >> 
> >> i'm new to nutch, and try to index my own (very small) homepage, with
> >> success!
> >> 
> >> My homepage is reachable in german an english, but when I try to crawl
> >> it with nutch, I only get the german content?
> >> 
> >> Here the command-line I used to crawl my site:
> >> 
> >> # bin/nutch crawl urls -dir crawl -depth 10
> >> 
> >> Here my crawl-urlfilter.txt
> >> 
> >> # skip file:, ftp:, & mailto: urls
> >> -^(file|ftp|mailto):
> >> 
> >> # skip image and other suffixes we can't yet parse
> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> >> pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >> 
> >> # skip URLs containing certain characters as probable queries, etc.
> >> -[...@=]
> >> 
> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> >> loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >> 
> >> # accept hosts in MY.DOMAIN.NAME
> >> # Tachtler
> >> # default: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >> +^http://www.tachtler.net/
> >> 
> >> # skip everything else
> >> -.
> >> 
> >> Here my nutch-default.xml (section: plugin.includes)
> >> 
> >> <property>
> >> 
> >>    <name>plugin.includes</name>
> >> 
> >> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(ba
> >> sic
> >> 
> >> |anchor|more)|query-(basic|site|url|lang)|response-(json|xml)|summary-ba
> >> |sic
> >> |scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-(de|en)|languag
> >> |e-id
> >> 
> >> entifier</value>
> >> 
> >> Please, can anyone help me?
> >> 
> >> 
> >> Klaus.
> >> 
> >> 
> >> --
> >> 
> >> ------------------------------------------------
> >> e-Mail  : [email protected]
> >> Homepage: http://www.tachtler.net
> >> DokuWiki: http://www.dokuwiki.tachtler.net
> >> ------------------------------------------------
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 
> ----- Ende der Nachricht von [email protected] -----
> 
> 
> 
> Grüße
> Klaus.
> 
> --
> 
> ------------------------------------------------
> e-Mail  : [email protected]
> Homepage: http://www.tachtler.net
> DokuWiki: http://www.dokuwiki.tachtler.net
> ------------------------------------------------

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Little Help for nutch Newbe...

Reply via email to