Hi

2010-05-06 Thread Zehra Göçer

i have problems about nutch.my project is link analysis i crawled 
www.mersin.edu.tr and i analyse linkdb and i saw all about mersin.edu.tr 
links.But i have to find other links in site example www.tubitak.gov.tr bu i 
cannot find?i have to find these links ?please help me  

_
Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin.
http://windows.microsoft.com/shop

Re: Hi

2010-05-06 Thread Harry Nutch
Did u check  crawl-urlfilter.txt?
All the domain names that you'd like to crawl have to mentioned.
e.g.

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*mersin\.edu\.tr/
+^http://([a-z0-9]*\.)*tubitak\.gov\.tr/

Also check property db.ignore.external.links in nutch-default.xml. Should be
set to false.

2010/5/5 Zehra Göçer zgocer...@hotmail.com


 i have problems about nutch.my project is link analysis i crawled 
 www.mersin.edu.tr and i analyse linkdb and i saw all about 
 mersin.edu.trlinks.But i have to find other links in site example
 www.tubitak.gov.tr bu i cannot find?i have to find these links ?please
 help me
 _
 Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin.
 http://windows.microsoft.com/shop


Re: Hi, and help with inject scoring...

2010-03-24 Thread Toby Cole

Excellent, I'll have a look at the patch.
Thanks, T

On 23/03/2010 19:25, Julien Nioche wrote:

Hi Toby,

Have a look at https://issues.apache.org/jira/browse/NUTCH-655
The patch has been committed to the SVN repository and should allow you to
do exactly what you described.

HTH

Julien

   




Hi, and help with inject scoring...

2010-03-23 Thread Toby Cole

Hi Nutch list,
We're using nutch for what basically amounts to an intranet crawl (just 
a few domains). We have a HUGE inject list as the site contains a lot of 
Ajax pages.


What I'm wondering is… is there a simple way of getting the injected 
URLs to have a higher default score than URLs injected from the normal 
crawl? I've tried upping the default score, but that also modifies the 
score URLs get when they're added from the crawl.


Many Thanks,
Toby.


Re: Hi, and help with inject scoring...

2010-03-23 Thread Julien Nioche
Hi Toby,

Have a look at https://issues.apache.org/jira/browse/NUTCH-655
The patch has been committed to the SVN repository and should allow you to
do exactly what you described.

HTH

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 23 March 2010 17:35, Toby Cole toby.c...@semantico.com wrote:

 Hi Nutch list,
 We're using nutch for what basically amounts to an intranet crawl (just a
 few domains). We have a HUGE inject list as the site contains a lot of Ajax
 pages.

 What I'm wondering is… is there a simple way of getting the injected URLs
 to have a higher default score than URLs injected from the normal crawl?
 I've tried upping the default score, but that also modifies the score URLs
 get when they're added from the crawl.

 Many Thanks,
 Toby.



Re: hi Kubes:the question about develop environment!

2009-04-23 Thread askNutch

hi kubes:
thank you for your answers!
i'm sorry that i didn't express my question.
i run nutch only on one machine! and ,i cann't debug hadoop in nutch.because
the hadoop's exist is lib.
how can i debug hadoop source in nutch?

and to my surprise ,the Tutorial RunNutchInEclipse1.0 doesn't start and
configure hadoop ,include master listen port etc.
when i debug nutch through breakpoint, it display:there is no source file
attached to the class file URLClassPath.class! why?

can hadoop run in vmware machine?

and i also met other problers ,it is in another message   run nutch on
eclipse problem? '

thanks !!!

Dennis Kubes-2 wrote:
 
 
 
 askNutch wrote:
 hi Kubes: 
 You are the expert!
 
 Can you tell me What is the develop environment do you use to
 develop nutch ?
 
 Linux, Ubuntu (usually the most recent), sun jdk, core2 laptop (although 
 hoping to upgrade to a sagernotebook.com quad core soon :) ), Eclipse 
 stable (3.4 I think).
 
 such as IDE etc.
 
 I want to debug nutch.
 
 Debugging MapReduce, hence Nutch, jobs is difficult.  The main reason 
 why is because Hadoop/Nutch spin up a new JVM for each Map and Reduce 
 job so it is difficult to connect to that JVM as it is created and 
 launched automagically.  Here are some options depending on what you are 
 trying to debug:
 
 1) Run all hadoop servers processes (namenode, etc.) through eclipse 
 using the internal debugger.  This isn't always the best way, usually 
 only used when debugging some part of the hadoop infrastructure such as 
 socket communication.
 
 2) Run most of the hadoop servers in separate processes, run the 
 tasktracker inside of eclipse with the internal debugger.  This is 
 mainly used when debugging a specific MapRunner, MapTask, or ReduceTask 
 interacting with Hadoop.  You won't be able to debug the Map or Reduce 
 task itself, just the communication with the Hadoop server, for instance 
   reporting status.
 
 3) Debugging the Map/Reduce task itself.  Logging.  Judicious logging is 
 most often what I use.  Also do very small example if you can help it to 
 give yourself small turnaround times.  Unless your problem is occurring 
 only on a large dataset, don't debug on a large data set.
 
 Hope this helps.
 
 Dennis
 

 thank you !!!  
 
 

-- 
View this message in context: 
http://www.nabble.com/hi-%3Athe-question-about-develop-environment%21-tp23170026p23191120.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: hi Kubes:the question about develop environment!

2009-04-23 Thread Dennis Kubes



askNutch wrote:

hi kubes:
thank you for your answers!
i'm sorry that i didn't express my question.
i run nutch only on one machine! and ,i cann't debug hadoop in nutch.because
the hadoop's exist is lib.
how can i debug hadoop source in nutch?


Build hadoop from scratch, or run inside of eclipse as a project.  You 
will have to startup each hadoop server manually through an eclipse 
launcher and will have to have the project source as part of the 
debugger source.




and to my surprise ,the Tutorial RunNutchInEclipse1.0 doesn't start and
configure hadoop ,include master listen port etc.
when i debug nutch through breakpoint, it display:there is no source file
attached to the class file URLClassPath.class! why?


When running it through eclipse you will also need to remove the lib 
hadoop jar from nutch (or at least from the classpath in eclipse) and 
put in the hadoop project.  This way it pulls from the hadoop source 
code and will display the source file.




can hadoop run in vmware machine?


Probably yes.  Many people run it under xen, don't know if there is that 
much difference.  I wouldn't see why there would be a problem as long as 
it can get socket access.


Dennis



and i also met other problers ,it is in another message   run nutch on
eclipse problem? '

thanks !!!

Dennis Kubes-2 wrote:



askNutch wrote:
hi Kubes: 
You are the expert!

Can you tell me What is the develop environment do you use to

develop nutch ?
Linux, Ubuntu (usually the most recent), sun jdk, core2 laptop (although 
hoping to upgrade to a sagernotebook.com quad core soon :) ), Eclipse 
stable (3.4 I think).

such as IDE etc.

I want to debug nutch.
Debugging MapReduce, hence Nutch, jobs is difficult.  The main reason 
why is because Hadoop/Nutch spin up a new JVM for each Map and Reduce 
job so it is difficult to connect to that JVM as it is created and 
launched automagically.  Here are some options depending on what you are 
trying to debug:


1) Run all hadoop servers processes (namenode, etc.) through eclipse 
using the internal debugger.  This isn't always the best way, usually 
only used when debugging some part of the hadoop infrastructure such as 
socket communication.


2) Run most of the hadoop servers in separate processes, run the 
tasktracker inside of eclipse with the internal debugger.  This is 
mainly used when debugging a specific MapRunner, MapTask, or ReduceTask 
interacting with Hadoop.  You won't be able to debug the Map or Reduce 
task itself, just the communication with the Hadoop server, for instance 
  reporting status.


3) Debugging the Map/Reduce task itself.  Logging.  Judicious logging is 
most often what I use.  Also do very small example if you can help it to 
give yourself small turnaround times.  Unless your problem is occurring 
only on a large dataset, don't debug on a large data set.


Hope this helps.

Dennis

   
thank you !!!  






Re: hi Kubes:the question about develop environment!

2009-04-23 Thread Susam Pal
On Thu, Apr 23, 2009 at 12:09 PM, askNutch hehehah...@126.com wrote:


 can hadoop run in vmware machine?


I am running a Hadoop cluster where each node is a VMware virtual
machine.  So, yes, it is possible. As long as you are able to connect
to sockets from one virtual machine to another, I don't see why you
can not run Hadoop in VMware virtual machines.

Regards,
Susam Pal


Re: hi Kubes:the question about develop environment!

2009-04-22 Thread Alexander Aristov
Why not to post such mails personally if you address to single person?

Want to know other opinions?

Best Regards
Alexander Aristov


2009/4/22 askNutch hehehah...@126.com


 hi Kubes:
You are the expert!

Can you tell me What is the develop environment do you use to
 develop nutch ?

such as IDE etc.

I want to debug nutch.

thank you !!!
 --
 View this message in context:
 http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: hi Kubes:the question about develop environment!

2009-04-22 Thread Dennis Kubes



askNutch wrote:
hi Kubes: 
You are the expert!

Can you tell me What is the develop environment do you use to

develop nutch ?


Linux, Ubuntu (usually the most recent), sun jdk, core2 laptop (although 
hoping to upgrade to a sagernotebook.com quad core soon :) ), Eclipse 
stable (3.4 I think).

such as IDE etc.

I want to debug nutch.


Debugging MapReduce, hence Nutch, jobs is difficult.  The main reason 
why is because Hadoop/Nutch spin up a new JVM for each Map and Reduce 
job so it is difficult to connect to that JVM as it is created and 
launched automagically.  Here are some options depending on what you are 
trying to debug:


1) Run all hadoop servers processes (namenode, etc.) through eclipse 
using the internal debugger.  This isn't always the best way, usually 
only used when debugging some part of the hadoop infrastructure such as 
socket communication.


2) Run most of the hadoop servers in separate processes, run the 
tasktracker inside of eclipse with the internal debugger.  This is 
mainly used when debugging a specific MapRunner, MapTask, or ReduceTask 
interacting with Hadoop.  You won't be able to debug the Map or Reduce 
task itself, just the communication with the Hadoop server, for instance 
 reporting status.


3) Debugging the Map/Reduce task itself.  Logging.  Judicious logging is 
most often what I use.  Also do very small example if you can help it to 
give yourself small turnaround times.  Unless your problem is occurring 
only on a large dataset, don't debug on a large data set.


Hope this helps.

Dennis

   
thank you !!!  


Re: hi Kubes:the question about develop environment!

2009-04-22 Thread Dennis Kubes



Alexander Aristov wrote:

Why not to post such mails personally if you address to single person?

Want to know other opinions?


I would :)

Dennis



Best Regards
Alexander Aristov


2009/4/22 askNutch hehehah...@126.com


hi Kubes:
   You are the expert!

   Can you tell me What is the develop environment do you use to
develop nutch ?

   such as IDE etc.

   I want to debug nutch.

   thank you !!!
--
View this message in context:
http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html
Sent from the Nutch - User mailing list archive at Nabble.com.






Re: hi Kubes:the question about develop environment!

2009-04-22 Thread Alexander Aristov
My environment is Windows Vista, MyEclipse7.0, sun jdk 6. This is enough in
most cases when I don't deal with hadoop debugging. For hadoop I run virtual
Fedora Linux and again use Eclipse.
But I usually improve/develop plugins so Vista is enough.

Best Regards
Alexander Aristov


2009/4/22 Dennis Kubes ku...@apache.org



 Alexander Aristov wrote:

 Why not to post such mails personally if you address to single person?

 Want to know other opinions?


 I would :)

 Dennis



 Best Regards
 Alexander Aristov


 2009/4/22 askNutch hehehah...@126.com

  hi Kubes:
   You are the expert!

   Can you tell me What is the develop environment do you use to
 develop nutch ?

   such as IDE etc.

   I want to debug nutch.

   thank you !!!
 --
 View this message in context:

 http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html
 Sent from the Nutch - User mailing list archive at Nabble.com.






hi Kubes:the question about develop environment!

2009-04-21 Thread askNutch

hi Kubes: 
You are the expert!

Can you tell me What is the develop environment do you use to
develop nutch ?

such as IDE etc.

I want to debug nutch.
   
thank you !!!  
-- 
View this message in context: 
http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Hi What is the use of refine-query-init.jsp,refine-query.jsp

2007-03-13 Thread Enis Soztutar

inalasuresh wrote:

Hi ,

I am uncommented the refine-query.jsp and refine-query-init.jsp in the
search.jsp
i searched for bikekeyword it given result.
Before that i am trying to run the application with comments  witout
comments .
but that had given the same result.
so plz any one can sugest me 

what is the use of the 
refine-query-init.jsp  refine-query.jsp.

what is the final result for using without comments for these jsp's

thanx  regards
suresh 
  
These two jsp files are part of the ontology extension point. basically 
plugins extending this extension point(currently ontology plugin) 
implement two function which are getSynonyms() and getSubclasses(). The 
ontolgy plugin thus gives synonyms(from wordnet) and subclasses from the 
defined ontologies for search query refinement.


you should enable ontology plugin and add some ontology url to the 
configuration. you can check the ontology plugin's readme file.





Hi What is the use of refine-query-init.jsp,refine-query.jsp

2007-03-12 Thread inalasuresh

Hi ,

I am uncommented the refine-query.jsp and refine-query-init.jsp in the
search.jsp
i searched for bikekeyword it given result.
Before that i am trying to run the application with comments  witout
comments .
but that had given the same result.
so plz any one can sugest me what is the use of the 
refine-query-init.jsp  refine-query.jsp.
what is the final result for using without comments for these jsp's 
-- 
View this message in context: 
http://www.nabble.com/Hi-What-is-the-use-of-refine-query-init.jsp%2Crefine-query.jsp-tf3389500.html#a9434697
Sent from the Nutch - User mailing list archive at Nabble.com.



Hi What is the use of refine-query-init.jsp,refine-query.jsp

2007-03-12 Thread inalasuresh

Hi ,

I am uncommented the refine-query.jsp and refine-query-init.jsp in the
search.jsp
i searched for bikekeyword it given result.
Before that i am trying to run the application with comments  witout
comments .
but that had given the same result.
so plz any one can sugest me 

what is the use of the 
refine-query-init.jsp  refine-query.jsp.
what is the final result for using without comments for these jsp's

thanx  regards
suresh 
-- 
View this message in context: 
http://www.nabble.com/Hi-What-is-the-use-of-refine-query-init.jsp%2Crefine-query.jsp-tf3389501.html#a9434699
Sent from the Nutch - User mailing list archive at Nabble.com.



Hi what is the use of subcollections.xml

2007-03-12 Thread inalasuresh

Hi ,
Any one help me. i am new for nutch..

what is the use of subcollections.xml
when it is called.
plz give the response for my query,...

thanx  regards
suresh..
-- 
View this message in context: 
http://www.nabble.com/Hi-what-is-the-use-of-subcollections.xml-tf3389528.html#a9434780
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Hi what is the use of subcollections.xml

2007-03-12 Thread Enis Soztutar

inalasuresh wrote:

Hi ,
Any one help me. i am new for nutch..

what is the use of subcollections.xml
when it is called.
plz give the response for my query,...

thanx  regards
suresh..
  

Hi,

Subcollections is a plugin for indexing the urls matching a regular 
expression and subcollections.xml is the configuration file it uses.


subcollection
   namenutch/name
   idnutch/id
   whitelist
http://lucene.apache.org/nutch/
/whitelist
   blacklist /
   /subcollection

when this plugin is enabled, nutch adds a field to the index with 
fieldname subcollection, and value nutch for the url  
http://lucene.apache.org/nutch/. refer to the plugin's readme file.


Hi...How to set Nutch-0.8.1 to save logs into log files when running the crawl job?

2006-12-21 Thread kevin

Hi,
How to set Nutch-0.8.1 to save logs into log files when running the crawl
job?
Is it setting in the nutch-site.xml, or other configuration file?

Thanks your help in advance!


--
kevin


Re: Hi...How to set Nutch-0.8.1 to save logs into log files when running the crawl job?

2006-12-21 Thread Sean Dean
You can play around with these two, by setting them to true in your 
nutch-site.xml file. Hadoop logs just about everything to logs/hadoop.log. The 
file truncates each day automatically, and places .year-month-day onto it.
 
property
  namehttp.verbose/name
  valuefalse/value
  descriptionIf true, HTTP will log more verbosely./description
/property

property
  namefetcher.verbose/name
  valuefalse/value
  descriptionIf true, fetcher will log more verbosely./description
/property



- Original Message 
From: kevin [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Thursday, December 21, 2006 10:55:38 PM
Subject: Hi...How to set Nutch-0.8.1 to save logs into log files when running 
the crawl job?


Hi,
How to set Nutch-0.8.1 to save logs into log files when running the crawl
job?
Is it setting in the nutch-site.xml, or other configuration file?

Thanks your help in advance!


-- 
kevin

hi all

2006-11-03 Thread kauu

hi
i have a problem now.
i want to crawl the pages which's url contain ...item_detail,but i
must crawl from the www..com
,and if i set rules in the crawl-urlfilter.txt,i can't get the pages what
i want at all.

so what i need to do now ?
should i do something with the regex-urlfilter.txt or something else?
--
www.babatu.com


hi all

2006-04-02 Thread kauu
hi all:
   i get a big problem when crawl the ftp.
  it seems that Nutch couldn't parse or index the files named in Chinese
so after the command looks like:

bin/nutch crawl urls.txt -dir test.dir

(i've modified the crawl-urlfilter.txt)


# skip file:, ftp:,  mailto: urls
#-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
+^ftp://*


when i seach something in tomcat 5.0.28 ,the results are messy character.
so anyone can tell me anything helpful to solve this big problem to me.
any reply will be appreciated.

--
www.babatu.com


Re: hi all

2006-04-02 Thread kauu
thx for advice!
now i know what's up.
but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the
LUKE to see the index, ant there are messy character when crawl the Chinese
webs.
  so ,how can i deal with it??

any reply will be appreciated.

On 4/2/06, Dan Morrill [EMAIL PROTECTED] wrote:

 Good Morning Kauu,

 I have noticed that Nutch only knows about UTF-8 character codes, so the
 simplified Chinese character set is UTF-8 and should come out ok. If the
 crawl sees Chinese in a non-utf-8, the web site may be serving them under
 an
 older ISO standard, or you may not have the language pack installed to
 properly support Chinese.

 Personally, I would download the language pack for your Operating system
 and
 see what happens.

 r/d

 -Original Message-
 From: kauu [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 02, 2006 7:48 AM
 To: nutch-user@lucene.apache.org
 Subject: hi all

 hi all:
i get a big problem when crawl the ftp.
   it seems that Nutch couldn't parse or index the files named in
 Chinese
 so after the command looks like:

 bin/nutch crawl urls.txt -dir test.dir

 (i've modified the crawl-urlfilter.txt)


 # skip file:, ftp:,  mailto: urls
 #-^(file|ftp|mailto):

 # skip image and other suffixes we can't yet parse

 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
 OV|exe|png|PNG)$

 # skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 # accept hosts in MY.DOMAIN.NAME
 +^ftp://*


 when i seach something in tomcat 5.0.28 ,the results are messy character.
 so anyone can tell me anything helpful to solve this big problem to me.
 any reply will be appreciated.

 --
 www.babatu.com




--
www.babatu.com


RE: hi all

2006-04-02 Thread Dan Morrill
Kauu,

Are you using the simplified Chinese character localaization package for
windows XP, or are you using the non simplied UTF version? You might need an
IME from here
http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx 

That may help out. 

Since you are using Luke to see the index, luke may not have the character
support built in for non utf-8 character sets (meaning gork when you look at
it). I went to the luke site http://www.getopt.org/luke/ to see if they make
mention of the character sets they support, but there is nothing that states
they support any character set. 

When you run your search, do you see good characters, or do you see gork?
Luke may not be able to understand the ISO character sets. (Hypothesis). 

r/d

-Original Message-
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 02, 2006 8:31 AM
To: nutch-user@lucene.apache.org
Subject: Re: hi all

thx for advice!
now i know what's up.
but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the
LUKE to see the index, ant there are messy character when crawl the Chinese
webs.
  so ,how can i deal with it??

any reply will be appreciated.

On 4/2/06, Dan Morrill [EMAIL PROTECTED] wrote:

 Good Morning Kauu,

 I have noticed that Nutch only knows about UTF-8 character codes, so the
 simplified Chinese character set is UTF-8 and should come out ok. If the
 crawl sees Chinese in a non-utf-8, the web site may be serving them under
 an
 older ISO standard, or you may not have the language pack installed to
 properly support Chinese.

 Personally, I would download the language pack for your Operating system
 and
 see what happens.

 r/d

 -Original Message-
 From: kauu [mailto:[EMAIL PROTECTED]
 Sent: Sunday, April 02, 2006 7:48 AM
 To: nutch-user@lucene.apache.org
 Subject: hi all

 hi all:
i get a big problem when crawl the ftp.
   it seems that Nutch couldn't parse or index the files named in
 Chinese
 so after the command looks like:

 bin/nutch crawl urls.txt -dir test.dir

 (i've modified the crawl-urlfilter.txt)


 # skip file:, ftp:,  mailto: urls
 #-^(file|ftp|mailto):

 # skip image and other suffixes we can't yet parse


-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
 OV|exe|png|PNG)$

 # skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

 # accept hosts in MY.DOMAIN.NAME
 +^ftp://*


 when i seach something in tomcat 5.0.28 ,the results are messy character.
 so anyone can tell me anything helpful to solve this big problem to me.
 any reply will be appreciated.

 --
 www.babatu.com




--
www.babatu.com



Re: hi all

2006-04-02 Thread Andrzej Bialecki

Dan Morrill wrote:

Since you are using Luke to see the index, luke may not have the character
support built in for non utf-8 character sets (meaning gork when you look at
it). I went to the luke site http://www.getopt.org/luke/ to see if they make
mention of the character sets they support, but there is nothing that states
they support any character set. 


When you run your search, do you see good characters, or do you see gork?
Luke may not be able to understand the ISO character sets. (Hypothesis). 
  


Hi,

(I'm the guy behind Luke)

Luke uses UTF-8, because that's what Lucene stores in the index. You may 
experience problems with the default font that it uses, i.e. that it 
doesn't support all Unicode characters. Please try to change the font 
(in Settings) and see if it helps.


Another frequent source of garbled characters is when you read the 
original content using wrong encoding, e.g. if you read a UTF-8 file 
using your native platform encoding like Latin1 or Big5, or the other 
way around. Then you get broken characters being encoded to UTF-8, when 
Lucene writes out the index, and restored from UTF-8 to their broken 
form when Luke reads the index


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: hi all

2006-04-02 Thread Dan Morrill
Andrzej,

Cheers! Good to know. Thanks!
r/d

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 02, 2006 5:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: hi all

Dan Morrill wrote:
 Since you are using Luke to see the index, luke may not have the character
 support built in for non utf-8 character sets (meaning gork when you look
at
 it). I went to the luke site http://www.getopt.org/luke/ to see if they
make
 mention of the character sets they support, but there is nothing that
states
 they support any character set. 

 When you run your search, do you see good characters, or do you see gork?
 Luke may not be able to understand the ISO character sets. (Hypothesis). 
   

Hi,

(I'm the guy behind Luke)

Luke uses UTF-8, because that's what Lucene stores in the index. You may 
experience problems with the default font that it uses, i.e. that it 
doesn't support all Unicode characters. Please try to change the font 
(in Settings) and see if it helps.

Another frequent source of garbled characters is when you read the 
original content using wrong encoding, e.g. if you read a UTF-8 file 
using your native platform encoding like Latin1 or Big5, or the other 
way around. Then you get broken characters being encoded to UTF-8, when 
Lucene writes out the index, and restored from UTF-8 to their broken 
form when Luke reads the index

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com