Re: Little Help for nutch Newbe...

Markus Jelsma Thu, 02 Dec 2010 04:41:32 -0800


On Thursday 02 December 2010 13:30:14 Klaus Tachtler wrote:
> Hi,
> 
> is there a possibility to use the following circumstance (just a idea):
> 
> When you enter my site http://www.tachtler.net with a browser and a
> default language in the browser --> (en) then you see the english
> content, an with a browser with default langauge in the browser -->
> (de), then you can see the german content.
> 
> Can I do something like this with nutch, crawling the same page twice,
> with two different "default-languages" ?


Yes, you could send a configured HTTP header and retrieve english or german 
content but it'll be the same URL, overwriting the previous language. 
> 
> Thank you for your time...
> 
> Klaus.
> 
> > Yes, but it isn't an URL Nutch can discover. Using language in this
> > manner is very difficult for crawler etc. Language information should
> > depend on the URL.
> > You have different content (en and de) on the same URL and... the URL is
> > a unique identifier in the Nutch documents.
> > 
> > I don't see a work-around here but maybe someone else does...
> > 
> > On Thursday 02 December 2010 13:13:46 Klaus Tachtler wrote:
> >> Hi,
> >> 
> >> Yes, but I have following HTTP-Header Information too:
> >> 
> >> german-Version:
> >> 
> >> <html xmlns="http://www.w3.org/1999/xhtml"; lang="de" xml:lang="de">
> >> <head>
> >> ...
> >> 
> >> english-Version:
> >> 
> >> <html xmlns="http://www.w3.org/1999/xhtml"; lang="en" xml:lang="en">
> >> <head>
> >> ...
> >> 
> >> Thank you for your answer...
> >> 
> >> Klaus.
> >> 
> >> > You're storing the language value in your session isn't it? Well,
> >> > there is your problem.
> >> > 
> >> > On Thursday 02 December 2010 12:56:31 Klaus Tachtler wrote:
> >> >> Hi List,
> >> >> 
> >> >> i'm new to nutch, and try to index my own (very small) homepage, with
> >> >> success!
> >> >> 
> >> >> My homepage is reachable in german an english, but when I try to
> >> >> crawl it with nutch, I only get the german content?
> >> >> 
> >> >> Here the command-line I used to crawl my site:
> >> >> 
> >> >> # bin/nutch crawl urls -dir crawl -depth 10
> >> >> 
> >> >> Here my crawl-urlfilter.txt
> >> >> 
> >> >> # skip file:, ftp:, & mailto: urls
> >> >> -^(file|ftp|mailto):
> >> >> 
> >> >> # skip image and other suffixes we can't yet parse
> >> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|g
> >> >> z|r pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> >> >> 
> >> >> # skip URLs containing certain characters as probable queries, etc.
> >> >> -[...@=]
> >> >> 
> >> >> # skip URLs with slash-delimited segment that repeats 3+ times, to
> >> >> break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >> >> 
> >> >> # accept hosts in MY.DOMAIN.NAME
> >> >> # Tachtler
> >> >> # default: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >> >> +^http://www.tachtler.net/
> >> >> 
> >> >> # skip everything else
> >> >> -.
> >> >> 
> >> >> Here my nutch-default.xml (section: plugin.includes)
> >> >> 
> >> >> <property>
> >> >> 
> >> >>    <name>plugin.includes</name>
> >> >> 
> >> >> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-
> >> >> (ba sic
> >> >> 
> >> >> |anchor|more)|query-(basic|site|url|lang)|response-(json|xml)|summary
> >> >> |-ba sic
> >> >> |scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-(de|en)|langu
> >> >> |ag e-id
> >> >> 
> >> >> entifier</value>
> >> >> 
> >> >> Please, can anyone help me?
> >> >> 
> >> >> 
> >> >> Klaus.
> >> >> 
> >> >> 
> >> >> --
> >> >> 
> >> >> ------------------------------------------------
> >> >> e-Mail  : [email protected]
> >> >> Homepage: http://www.tachtler.net
> >> >> DokuWiki: http://www.dokuwiki.tachtler.net
> >> >> ------------------------------------------------
> >> > 
> >> > --
> >> > Markus Jelsma - CTO - Openindex
> >> > http://www.linkedin.com/in/markus17
> >> > 050-8536620 / 06-50258350
> >> 
> >> ----- Ende der Nachricht von [email protected] -----
> >> 
> >> 
> >> 
> >> Grüße
> >> Klaus.
> >> 
> >> --
> >> 
> >> ------------------------------------------------
> >> e-Mail  : [email protected]
> >> Homepage: http://www.tachtler.net
> >> DokuWiki: http://www.dokuwiki.tachtler.net
> >> ------------------------------------------------
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 
> ----- Ende der Nachricht von [email protected] -----
> 
> 
> 
> Grüße
> Klaus.
> 
> --
> 
> ------------------------------------------------
> e-Mail  : [email protected]
> Homepage: http://www.tachtler.net
> DokuWiki: http://www.dokuwiki.tachtler.net
> ------------------------------------------------

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Little Help for nutch Newbe...

Reply via email to