On Thursday 02 December 2010 13:30:14 Klaus Tachtler wrote: > Hi, > > is there a possibility to use the following circumstance (just a idea): > > When you enter my site http://www.tachtler.net with a browser and a > default language in the browser --> (en) then you see the english > content, an with a browser with default langauge in the browser --> > (de), then you can see the german content. > > Can I do something like this with nutch, crawling the same page twice, > with two different "default-languages" ?
Yes, you could send a configured HTTP header and retrieve english or german content but it'll be the same URL, overwriting the previous language. > > Thank you for your time... > > Klaus. > > > Yes, but it isn't an URL Nutch can discover. Using language in this > > manner is very difficult for crawler etc. Language information should > > depend on the URL. > > You have different content (en and de) on the same URL and... the URL is > > a unique identifier in the Nutch documents. > > > > I don't see a work-around here but maybe someone else does... > > > > On Thursday 02 December 2010 13:13:46 Klaus Tachtler wrote: > >> Hi, > >> > >> Yes, but I have following HTTP-Header Information too: > >> > >> german-Version: > >> > >> <html xmlns="http://www.w3.org/1999/xhtml" lang="de" xml:lang="de"> > >> <head> > >> ... > >> > >> english-Version: > >> > >> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> > >> <head> > >> ... > >> > >> Thank you for your answer... > >> > >> Klaus. > >> > >> > You're storing the language value in your session isn't it? Well, > >> > there is your problem. > >> > > >> > On Thursday 02 December 2010 12:56:31 Klaus Tachtler wrote: > >> >> Hi List, > >> >> > >> >> i'm new to nutch, and try to index my own (very small) homepage, with > >> >> success! > >> >> > >> >> My homepage is reachable in german an english, but when I try to > >> >> crawl it with nutch, I only get the german content? > >> >> > >> >> Here the command-line I used to crawl my site: > >> >> > >> >> # bin/nutch crawl urls -dir crawl -depth 10 > >> >> > >> >> Here my crawl-urlfilter.txt > >> >> > >> >> # skip file:, ftp:, & mailto: urls > >> >> -^(file|ftp|mailto): > >> >> > >> >> # skip image and other suffixes we can't yet parse > >> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|g > >> >> z|r pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > >> >> > >> >> # skip URLs containing certain characters as probable queries, etc. > >> >> -[...@=] > >> >> > >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to > >> >> break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > >> >> > >> >> # accept hosts in MY.DOMAIN.NAME > >> >> # Tachtler > >> >> # default: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ > >> >> +^http://www.tachtler.net/ > >> >> > >> >> # skip everything else > >> >> -. > >> >> > >> >> Here my nutch-default.xml (section: plugin.includes) > >> >> > >> >> <property> > >> >> > >> >> <name>plugin.includes</name> > >> >> > >> >> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index- > >> >> (ba sic > >> >> > >> >> |anchor|more)|query-(basic|site|url|lang)|response-(json|xml)|summary > >> >> |-ba sic > >> >> |scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-(de|en)|langu > >> >> |ag e-id > >> >> > >> >> entifier</value> > >> >> > >> >> Please, can anyone help me? > >> >> > >> >> > >> >> Klaus. > >> >> > >> >> > >> >> -- > >> >> > >> >> ------------------------------------------------ > >> >> e-Mail : [email protected] > >> >> Homepage: http://www.tachtler.net > >> >> DokuWiki: http://www.dokuwiki.tachtler.net > >> >> ------------------------------------------------ > >> > > >> > -- > >> > Markus Jelsma - CTO - Openindex > >> > http://www.linkedin.com/in/markus17 > >> > 050-8536620 / 06-50258350 > >> > >> ----- Ende der Nachricht von [email protected] ----- > >> > >> > >> > >> Grüße > >> Klaus. > >> > >> -- > >> > >> ------------------------------------------------ > >> e-Mail : [email protected] > >> Homepage: http://www.tachtler.net > >> DokuWiki: http://www.dokuwiki.tachtler.net > >> ------------------------------------------------ > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > > ----- Ende der Nachricht von [email protected] ----- > > > > Grüße > Klaus. > > -- > > ------------------------------------------------ > e-Mail : [email protected] > Homepage: http://www.tachtler.net > DokuWiki: http://www.dokuwiki.tachtler.net > ------------------------------------------------ -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

