You're storing the language value in your session isn't it? Well, there is 
your problem. 

On Thursday 02 December 2010 12:56:31 Klaus Tachtler wrote:
> Hi List,
> 
> i'm new to nutch, and try to index my own (very small) homepage, with
> success!
> 
> My homepage is reachable in german an english, but when I try to crawl
> it with nutch, I only get the german content?
> 
> Here the command-line I used to crawl my site:
> 
> # bin/nutch crawl urls -dir crawl -depth 10
> 
> Here my crawl-urlfilter.txt
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[...@=]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # accept hosts in MY.DOMAIN.NAME
> # Tachtler
> # default: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://www.tachtler.net/
> 
> # skip everything else
> -.
> 
> Here my nutch-default.xml (section: plugin.includes)
> 
> <property>
>    <name>plugin.includes</name>
> 
> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic
> |anchor|more)|query-(basic|site|url|lang)|response-(json|xml)|summary-basic
> |scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-(de|en)|language-id
> entifier</value>
> 
> Please, can anyone help me?
> 
> 
> Klaus.
> 
> 
> --
> 
> ------------------------------------------------
> e-Mail  : [email protected]
> Homepage: http://www.tachtler.net
> DokuWiki: http://www.dokuwiki.tachtler.net
> ------------------------------------------------

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to