Re: Little Help for nutch Newbe...

Klaus Tachtler Thu, 02 Dec 2010 04:30:44 -0800

Hi,

is there a possibility to use the following circumstance (just a idea):

When you enter my site http://www.tachtler.net with a browser and adefault language in the browser --> (en) then you see the englishcontent, an with a browser with default langauge in the browser -->(de), then you can see the german content.

Can I do something like this with nutch, crawling the same page twice,with two different "default-languages" ?


Thank you for your time...

Klaus.

Yes, but it isn't an URL Nutch can discover. Using language in this manner is

very difficult for crawler etc. Language information should dependon the URL.

You have different content (en and de) on the same URL and... the URL is a
unique identifier in the Nutch documents.

I don't see a work-around here but maybe someone else does...

On Thursday 02 December 2010 13:13:46 Klaus Tachtler wrote:

Hi,

Yes, but I have following HTTP-Header Information too:

german-Version:

<html xmlns="http://www.w3.org/1999/xhtml"; lang="de" xml:lang="de">
<head>
...

english-Version:

<html xmlns="http://www.w3.org/1999/xhtml"; lang="en" xml:lang="en">
<head>
...

Thank you for your answer...

Klaus.

> You're storing the language value in your session isn't it? Well, there
> is your problem.
>
> On Thursday 02 December 2010 12:56:31 Klaus Tachtler wrote:
>> Hi List,
>>
>> i'm new to nutch, and try to index my own (very small) homepage, with
>> success!
>>
>> My homepage is reachable in german an english, but when I try to crawl
>> it with nutch, I only get the german content?
>>
>> Here the command-line I used to crawl my site:
>>
>> # bin/nutch crawl urls -dir crawl -depth 10
>>
>> Here my crawl-urlfilter.txt
>>
>> # skip file:, ftp:, & mailto: urls
>> -^(file|ftp|mailto):
>>
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
>> pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> -[...@=]
>>
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> # accept hosts in MY.DOMAIN.NAME
>> # Tachtler
>> # default: +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^http://www.tachtler.net/
>>
>> # skip everything else
>> -.
>>
>> Here my nutch-default.xml (section: plugin.includes)
>>
>> <property>
>>
>>    <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(ba
>> sic
>>
>> |anchor|more)|query-(basic|site|url|lang)|response-(json|xml)|summary-ba
>> |sic
>> |scoring-opic|urlnormalizer-(pass|regex|basic)|analysis-(de|en)|languag
>> |e-id
>>
>> entifier</value>
>>
>> Please, can anyone help me?
>>
>>
>> Klaus.
>>
>>
>> --
>>
>> ------------------------------------------------
>> e-Mail  : [email protected]
>> Homepage: http://www.tachtler.net
>> DokuWiki: http://www.dokuwiki.tachtler.net
>> ------------------------------------------------
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350

----- Ende der Nachricht von [email protected] -----



Grüße
Klaus.

--

------------------------------------------------
e-Mail  : [email protected]
Homepage: http://www.tachtler.net
DokuWiki: http://www.dokuwiki.tachtler.net
------------------------------------------------


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



----- Ende der Nachricht von [email protected] -----



Grüße
Klaus.

--

------------------------------------------------
e-Mail  : [email protected]
Homepage: http://www.tachtler.net
DokuWiki: http://www.dokuwiki.tachtler.net
------------------------------------------------

Re: Little Help for nutch Newbe...

Reply via email to