Hi,

Yes, but I have following HTTP-Header Information too:

german-Version:

<html xmlns="http://www.w3.org/1999/xhtml"; lang="de" xml:lang="de">
<head>
...

english-Version:

<html xmlns="http://www.w3.org/1999/xhtml"; lang="en" xml:lang="en">
<head>
...

Thank you for your answer...

Klaus.


You're storing the language value in your session isn't it? Well, there is
your problem.

On Thursday 02 December 2010 12:56:31 Klaus Tachtler wrote:
Hi List,

i'm new to nutch, and try to index my own (very small) homepage, with
success!

My homepage is reachable in german an english, but when I try to crawl
it with nutch, I only get the german content?

Here the command-line I used to crawl my site:

# bin/nutch crawl urls -dir crawl -depth 10

Here my crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[...@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# Tachtler
# default: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://www.tachtler.net/

# skip everything else
-.

Here my nutch-default.xml (section: plugin.includes)

<property>
   <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic
|anchor|more)|query-(basic|site|url|lang)|response-(json|xml)|summary-basic
|scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-(de|en)|language-id
entifier</value>

Please, can anyone help me?


Klaus.


--

------------------------------------------------
e-Mail  : [email protected]
Homepage: http://www.tachtler.net
DokuWiki: http://www.dokuwiki.tachtler.net
------------------------------------------------

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350




----- Ende der Nachricht von [email protected] -----



Grüße
Klaus.

--

------------------------------------------------
e-Mail  : [email protected]
Homepage: http://www.tachtler.net
DokuWiki: http://www.dokuwiki.tachtler.net
------------------------------------------------

Reply via email to