Re: Please, unsubscribe me

2009-10-29 Thread David Jashi
You are doomed to read about Nutch to the very ends of your existence, people.


2009/10/29 Le Manh Cuong cuong...@gmail.com:
 Sorry but the last time I try to unsubscribe, It don’t work.
 And now it don’t work also, :).

 -Original Message-
 From: SunGod [mailto:sun...@cheemer.org]
 Sent: Thursday, October 29, 2009 10:09 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Please, unsubscribe me

 List-Help: mailto:nutch-user-h...@lucene.apache.org
 List-Unsubscribe: mailto:nutch-user-unsubscr...@lucene.apache.org
 List-Post: mailto:nutch-user@lucene.apache.org
 List-Id: nutch-user.lucene.apache.org

 2009/10/29 Le Manh Cuong cuong...@gmail.com

 Me too, Could you please help to remove me (cuong09m @gmail.com) from the
 nutch and hadoop mail list?

 -Original Message-
 From: caoyuzhong [mailto:caoyuzh...@hotmail.com]
 Sent: Thursday, October 29, 2009 9:49 AM
 To: nutch-user@lucene.apache.org
  Subject: RE: Please, unsubscribe me


 The unsubscription message does not work for me too.
 Could you please help to remove me (caoyuzh...@hotmail.com) from the nutch
 and hadoop mail list?

  Subject: Please, unsubscribe  me
  From: nsa...@officinedigitali.it
  To: nutch-user@lucene.apache.org
  Date: Wed, 28 Oct 2009 16:43:05 +0100
 
  Hi,
  the unsubscription message doesn't work. Please, remove me from the
  list.
 
  Thanks.
 
 

 _
 全新 Windows 7:寻找最适合您的 PC。了解详情。
 http://www.microsoft.com/china/windows/buy/






Re: Authenticity of URLs from DMOZ

2009-10-06 Thread David Jashi
Gaurang,

About that AVG alerts - you are fetching web pages together with all
viruses they may be infected with.
Of course, antivirus software will scream about it.

I wouldn't run any kind of such software on crawling machine.

პატივისცემით,
დავით ჯაში




On Tue, Oct 6, 2009 at 12:36, Gaurang Patel gaurangtpa...@gmail.com wrote:
 Hey,

 Can anyone tell what could be the reason for following which happened while
 fetching data using bin/nutch fetch:

 My AVG Antivirus is detecting virus threats while Nutch fetches pages from
 available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb.
 Antivirus already detected 4 threats within only half an hour after start of
 fetching.

 Is there any other way(any source other than DMOZ) to get list of whole web
 urls ? Or is there an automatic way to avoid such harrmful urls from being
 fetched? Let me know asap.


 Regards,
 Gaurang



Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Shame on me! Thanks a lot.


 it's some thing like that



 property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
 /property


 From: da...@jashi.ge
 Date: Tue, 29 Sep 2009 18:59:52 +0400
 Subject: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org

 Hello, all.

 I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

 I have fresh install of Nutch and two analysis plugins I'd like to turn on:
 analysis-de (German) and analysis-ge (Georgian)
 Here are the innards of my seed file:
 ---
 http://212.72.133.54/l/test.html
 http://212.72.133.54/l/de.html
 ---
 The first is Georgian, other - German. When I run

 bin/nutch crawl seed -dir crawl -threads 10 -depth 2

 there is not a slightest sign of someone calling any analysis
 plug-ins, even though it's clearly stated in hadoop.log, that they are
 on and active:
 ---
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
 injected urls to crawl db entries.
 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
 looking in: C:\cygwin\opt\nutch\plugins
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
 Auto-activation mode: [true]
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
 core extension points (nutch-extensionpoints)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
 Filter (query-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
 Analysers (lib-lucene-analyzers)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
 Normalizer (urlnormalizer-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
 Identification Parser/Filter (language-identifier)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
 Plug-in (parse-html)

 !
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
 Analysis Plug-in (analysis-ge)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
 Analysis Plug-in (analysis-de)
 !

 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Indexing Filter (index-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Summarizer Plug-in (summary-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
 Filter (query-site)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
 Framework (lib-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
 Plug-in (parse-text)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
 Filter (query-more)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Filter (urlfilter-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
 URL Normalizer (urlnormalizer-pass)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
 Plug-in (protocol-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Normalizer (urlnormalizer-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
 Plug-in (scoring-opic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
 HTML Parser (lib-nekohtml)
 2009-09-29 

Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Ooops. It IS activated.

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
Language Identification Parser/Filter (language-identifier)

But fetched pages are not passed to it, as I recon.


Re: graphical user interface v0.2 for nutch

2009-09-30 Thread David Jashi
Any documentation on how to add this GUI to existing NUtch instance?

პატივისცემით,
დავით ჯაში




2009/9/30 Bartosz Gadzimski bartek...@o2.pl:
 Hello,

 First - great job, it looks and works very nice.

 I have a question about urlfilters. Is this possible to get regex-urlfilter
 per instance (different for each instance) ?

 Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

 Feature request - option to merge segments or maybe removing old one ?

 Thanks,
 Bartosz



Re: graphical user interface v0.2 for nutch

2009-09-30 Thread David Jashi
Thanks,

Sorry for my bad English, I`ll rephrase:

Can I add this GUI to existing Nutch installation? I've made some
modifications to mine, so starting from scratch would be quite
time-consuming.

პატივისცემით,
დავით ჯაში




On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote:
 Hi David.
 sorry i dont understand your question. documentation about the nutch gui can
 you find here
 http://wiki.github.com/101tec/nutch

 marko




 On Sep 30, 2009, at 4:02 PM, David Jashi wrote:

 Any documentation on how to add this GUI to existing NUtch instance?

 პატივისცემით,
 დავით ჯაში




 2009/9/30 Bartosz Gadzimski bartek...@o2.pl:

 Hello,

 First - great job, it looks and works very nice.

 I have a question about urlfilters. Is this possible to get
 regex-urlfilter
 per instance (different for each instance) ?

 Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

 Feature request - option to merge segments or maybe removing old one ?

 Thanks,
 Bartosz






Re: graphical user interface v0.2 for nutch

2009-09-30 Thread David Jashi
That's 1.0

Thanks a lot. I'll give it a try.

პატივისცემით,
დავით ჯაში




On Wed, Sep 30, 2009 at 18:37, Marko Bauhardt m...@101tec.com wrote:


 Sorry for my bad English, I`ll rephrase:

 :) No Problem.



 Can I add this GUI to existing Nutch installation? I've made some
 modifications to mine, so starting from scratch would be quite
 time-consuming.

 Ah ok understand. Hm. The gui is forked from the release-1.0 tag. what for
 nutch version you have patched?
 you can try to make a diff on the release-1.0 to create a patch file. after
 that you can checkout or download the gui and try to apply your patch.
 maybe this could work.



 marko





 პატივისცემით,
 დავით ჯაში




 On Wed, Sep 30, 2009 at 18:19, Marko Bauhardt m...@101tec.com wrote:

 Hi David.
 sorry i dont understand your question. documentation about the nutch gui
 can
 you find here
 http://wiki.github.com/101tec/nutch

 marko




 On Sep 30, 2009, at 4:02 PM, David Jashi wrote:

 Any documentation on how to add this GUI to existing NUtch instance?

 პატივისცემით,
 დავით ჯაში




 2009/9/30 Bartosz Gadzimski bartek...@o2.pl:

 Hello,

 First - great job, it looks and works very nice.

 I have a question about urlfilters. Is this possible to get
 regex-urlfilter
 per instance (different for each instance) ?

 Also what for is nutch-gui/conf/regex-urlfilter.txt file ?

 Feature request - option to merge segments or maybe removing old one ?

 Thanks,
 Bartosz









Multilanguage support in Nutch 1.0

2009-09-29 Thread David Jashi
Hello, all.

I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

I have fresh install of Nutch and two analysis plugins I'd like to turn on:
analysis-de (German) and analysis-ge (Georgian)
Here are the innards of my seed file:
---
http://212.72.133.54/l/test.html
http://212.72.133.54/l/de.html
---
The first is Georgian, other - German. When I run

bin/nutch crawl seed -dir crawl -threads 10 -depth 2

there is not a slightest sign of someone calling any analysis
plug-ins, even though it's clearly stated in hadoop.log, that they are
on and active:
---
2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
looking in: C:\cygwin\opt\nutch\plugins
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
Analysers (lib-lucene-analyzers)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
Identification Parser/Filter (language-identifier)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)

!
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
Analysis Plug-in (analysis-ge)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
Analysis Plug-in (analysis-de)
!

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
URL Normalizer (urlnormalizer-pass)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
---

At the same time:

---
2009-09-29 16:39:54,406 INFO  lang.LanguageIdentifier - Language
identifier configuration [1-4/2048]
2009-09-29 16:39:54,609 INFO  lang.LanguageIdentifier - Language
identifier plugin supports: it(1000) is(1000) hu(1000) th(1000)
sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
nl(1000)
---

Language indentifier works as a charm at the same time:
---
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/test.html
text was identified as ge
---
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/de.html
text was identified as de
---

What could have possibly gone wrong?

პატივისცემით,
დავით ჯაში


Fwd: Release 1.0?

2009-02-03 Thread David Jashi
David Jashi wrote:

  Wow. Does it mean we'll have live indexing out of the box?

 If by live you mean that you can index a fetched  parsed segment, and
have it appear
 immediately in live search after you commit, then yes. Other than that,
Nutch still uses
 segments as a unit of work, so the segment generation / fetch / parsing /
updatedb etc. are
 still batch operations that take time.

Yes, that's what I meant. Very nice.


  By the way, is there any chance to modify stemming to process several
  wordforms (tokens) at once, and not one-by one? That would really
  increase speed of my external stemming.

 You can implement your own analyzer, which first caches all tokens from
TokenStream, and
 then passes them all at once to the external process.

Thanks for the hint, I'll dig in that direction.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
da...@jashi.ge

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
da...@jashi.ge


Re: Release 1.0?

2009-02-02 Thread David Jashi
Wow. Does it mean we'll have live indexing out of the box?

By the way, is there any chance to modify stemming to process several
wordforms (tokens) at once, and not one-by one? That would really
increase speed of my external stemming.

2009/2/2 Tony Wang ivyt...@gmail.com:
 I definitely like the Nutch/Solr integration the best! Thanks guys!

-- 
with best regards,
David Jashi


Stemmer

2009-01-19 Thread David Jashi
Hello, everyone.

Is there any chance to make Nutch call stemmer in batch? That is, give him
not a single word (token), but array of words. My stemmer has external
parts, called by HTTP request, so you can imagine, what performance overhead
I have.

-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
da...@jashi.ge

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
da...@jashi.ge


Re: Nutch Training Seminar

2008-11-30 Thread David Jashi
If it's over the web - I'm in.

On Mon, Dec 1, 2008 at 3:23 AM, Windflying [EMAIL PROTECTED] wrote:
 I'm interested.

 Cheeres,

 -Original Message-
 From: Dennis Kubes [mailto:[EMAIL PROTECTED]
 Sent: Monday, 1 December 2008 2:51 AM
 To: nutch-user@lucene.apache.org
 Subject: Re: Nutch Training Seminar

 Ok.  Seems like a lot of people are interested.  I will put something
 together and keep everyone up to date.  Thanks to everyone who responded.

 Dennis

 Dennis Kubes wrote:
 Would anybody be interested in a Nutch training seminar that goes over
 the following:

 1) Installing and configuration of Nutch
 2) Crawling the web, the CrawlDb, and URL filters
 3) Parsing and Parse filters
 4) Nutch plugins and plugin architecture
 5) Analysis, Link analysis, and scoring
 6) Indexing and custom fields
 7) Deployment, shard architecture
 8) Writing custom tools for Nutch
 9) Hadoop architecture

 Are there other things people would want to go over?

 Dennis





-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: Language Analysis Plugins

2008-11-26 Thread David Jashi
That would be nice

On Wed, Nov 26, 2008 at 7:31 PM, Dennis Kubes [EMAIL PROTECTED] wrote:
 For the 1.0 release would everyone like the new language analysis plugins
 for different languages activated by default.  Currently they language
 analysis plugins are not activated by default.  We are adding 8 new
 languages and 7 new plugins.

 Dennis




-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: Lost regrading Stemming in nutch

2008-10-31 Thread David Jashi
I managed to connect Nutch 0.9 to my stemming machine. Don't know if
my approach would work on 0.8.1

On Wed, Oct 29, 2008 at 10:56 PM, jcze [EMAIL PROTECTED] wrote:

 Hi, i'm using nutch 0.8.1, I'm lost about the stemming of nutch, tried the
 wiki on MultiLingual Support. coz it said that it could stem the words..
 hmm.. but I'm lost because it said that I need to modify the IndexSegment
 class which i couldnt find.. =(

 Anywayz, i tried the stemming for nutch 8.. but i'm lost again.. so it didnt
 work again.. =(

 I need some guidance and help.. really really lost =((
 --
 View this message in context: 
 http://www.nabble.com/Lost-regrading-Stemming-in-nutch-tp20233602p20233602.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: encoding

2008-09-29 Thread David Jashi
ყველაფერი რიგზეა, utf-8 მაგივრად nutch რამოღაც 16–ბიტიანს აბრუნებს.

It's OK, for some strange reason Nutch uses this encoding instead of
UTF-8. Text is displayed normally anyhow.

On Mon, Sep 29, 2008 at 1:04 PM, daut [EMAIL PROTECTED] wrote:

 hello,
 I've installed nutch-0.9 and made first crawling.Then I've made search on
 search page. Everithing seems ok. I can see all result characters correctly.
 (non ASCI characters, Georgian language). But when I view page source,
 Instead of georgian letters, for example პოლ, there are such
 simbols:_#_4_3_1_8;_#_4_3_1_7;_#_4_3_1_4;.(without _ simbols :) )  Why
 happens this? Is it normal?
 Best Rgds daut.


 --
 View this message in context: 
 http://www.nabble.com/encoding-tp19720443p19720443.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: encoding

2008-09-29 Thread David Jashi
It's definitely Tomcat. I just browsed through
segments/*/content/part-*/data files with hex viewer and it looks like
Nutch uses some sort of compression.

2008/9/29 daut [EMAIL PROTECTED]:

 I want to use utf-8. How can I force nutch to use utf-8? Or is it tomcat
 issue?


 David Jashi wrote:

 ყველაფერი რიგზეა, utf-8 მაგივრად nutch რამოღაც 16–ბიტიანს აბრუნებს.

 It's OK, for some strange reason Nutch uses this encoding instead of
 UTF-8. Text is displayed normally anyhow.

 On Mon, Sep 29, 2008 at 1:04 PM, daut [EMAIL PROTECTED] wrote:

 hello,
 I've installed nutch-0.9 and made first crawling.Then I've made search on
 search page. Everithing seems ok. I can see all result characters
 correctly.
 (non ASCI characters, Georgian language). But when I view page source,
 Instead of georgian letters, for example პოლ, there are such
 simbols:_#_4_3_1_8;_#_4_3_1_7;_#_4_3_1_4;.(without _ simbols :) )
 Why
 happens this? Is it normal?
 Best Rgds daut.


 --
 View this message in context:
 http://www.nabble.com/encoding-tp19720443p19720443.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





 --
 with best regards,
 David Jashi
 Web development EO,
 Caucasus Online
 +995(32)970368
 [EMAIL PROTECTED]

 პატივისცემით,
 დავით ჯაში
 ვებ–განვითარების დირექტორი
 კავკასუს  ონლაინი
 +995(32)970368
 [EMAIL PROTECTED]



 --
 View this message in context: 
 http://www.nabble.com/encoding-tp19720443p19721356.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს  ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: pages with duplicate content in search results

2008-09-25 Thread David Jashi
Sorry for off-topic, but how do you make Nutch-0.9 search multiple indexes?

On Thu, Sep 25, 2008 at 4:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote:
 If you are using more than one index then dedup will not work across
 indexes.  A single index should dedup correctly unless the pages are not
 exact duplicates but near duplicates.  The dedup process works on url and
 byte hash.  If the content is even 1 byte different, it doesn't work.

 Near duplicate detection is another set of algorithms that haven't been
 implemented in Nutch yet.  On the query site you can set hte hitsPerSite to
 1 and it should limit your search results.

 Dennis

 Edward Quick wrote:

 Hi,

 Eventhough I ran nutch dedup on my index, I still have pages with
 different urls but the exactly the same content (see search result example
 below). From what I read up on dedup this shouldn't happen though as it
 deletes the url with the lowest score. Is there anything else I can try to
 get rid of these?

 Thanks,
 Ed.

 Item Document :- Client - TeraTerm Pro
 ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
 Online   Employee Self Service   ESS Home ... Description Document
 Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
 Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
 printing or keymapping is an issue, TeraTerm ...

 http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument
 (cached) (explain) (anchors)



 Item Document :- Client - TeraTerm Pro
 ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
 Online   Employee Self Service   ESS Home ... Description Document
 Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
 Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
 printing or keymapping is an issue, TeraTerm ...

 http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument
 (cached) (explain) (anchors)
 _
 Make a mini you and download it into Windows Live Messenger
 http://clk.atdmt.com/UKM/go/111354029/direct/01/




-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: where to find the location of rss feed

2008-09-20 Thread David Jashi
Hello, Arun.

The easiest way to get RSS file is to right click on the RSS link in
the lower right corner of Nutch search results, and choose Save
Target As... in pop-up menu. The address should be something like:

http://hostname/nutch-0.9/search.jsp?query=testhitsPerPage=10lang=en

On Sat, Sep 20, 2008 at 7:37 AM, Arun Kamal [EMAIL PROTECTED] wrote:

 hi all, m a newbie in nutch trying to see the way the rss is stored. i
 couldnt find a way of getting to the rss feed file. i want to see this to
 understand the way this rss is stored.
 plz help me.
 thx in advance,
 Arun Kamal
 --
 View this message in context: 
 http://www.nabble.com/where-to-find-the-location-of-rss-feed-tp19582613p19582613.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: Dedup

2008-09-19 Thread David Jashi
Thanks, Andrzej.

In fact I was meaning DD/MM/.

Anyway, knowing, that dedup is keeping latest version of file makes my
life a bit easier.

On Fri, Sep 19, 2008 at 12:35 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:

 Isn't he in fact NOT using the US date notation?  AFAIK, the US date
 notation is mm/dd/.

 Hehe, you are right - the joke is on me - the use of slashes misled me.

 Still my answer holds - dedup will keep just the latest version of the page.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Dedup

2008-09-18 Thread David Jashi
Hello, colleagues.

I have a theoretical question - let's say
on 01/01/2008 we have crawled page http://www.site.com/page.html
on 10/01/2008 the page changed
on 01/02/2008 we crawled it once again and merged old and new indexes

which version of this page Nutch dedup will leave in index?

-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Problems with highlighter

2008-09-12 Thread David Jashi
Hello,

I've implemented GeorgianStemmer for nutch, even modified source of
BasicQueryFilter.java  to use stems for search. Here is modified part
of source code as follows:

  private org.apache.lucene.search.Query
   exactPhrase(Phrase nutchPhrase,
   String field, float boost){
Term[] terms = nutchPhrase.getTerms();
PhraseQuery exactPhrase = new PhraseQuery();
for (int i = 0; i  terms.length; i++) {
  exactPhrase.add(luceneTerm(field, terms[i]));
}
exactPhrase.setBoost(boost);
return exactPhrase;
  }

Everything works, there is only one little cloud one the blue sky of
my happiness: damn Highlighter never works when I return results, that
contain searched words in different form.

To make make it clear (and translating it to English), if I look for
watery it finds strings, containing water and watery, but
highlights only those, who match search criteria., i.e. watery.

Any ideas?


Re: Problems with highlighter

2008-09-12 Thread David Jashi
(%D9,\u10E6)
  .replaceAll(%DA,\u10E7)
  .replaceAll(%DB,\u10E8)
  .replaceAll(%DC,\u10E9)
  .replaceAll(%DD,\u10EA)
  .replaceAll(%DE,\u10EB)
  .replaceAll(%DF,\u10EC)
  .replaceAll(%E0,\u10ED)
  .replaceAll(%E1,\u10EE)
  .replaceAll(%E3,\u10EF)
  .replaceAll(%E4,\u10F0);
  return b;
}

private static String recodeEscIKELat( String term )
{
  String b=term
  .replaceAll(%C0,a)
  .replaceAll(%C1,b)
  .replaceAll(%C2,g)
  .replaceAll(%C3,d)
  .replaceAll(%C4,e)
  .replaceAll(%C5,v)
  .replaceAll(%C6,z)
  .replaceAll(%C8,T)
  .replaceAll(%C9,i)
  .replaceAll(%CA,k)
  .replaceAll(%CB,l)
  .replaceAll(%CC,m)
  .replaceAll(%CD,n)
  .replaceAll(%CF,o)
  .replaceAll(%D0,p)
  .replaceAll(%D1,J)
  .replaceAll(%D2,r)
  .replaceAll(%D3,s)
  .replaceAll(%D4,t)
  .replaceAll(%D6,u)
  .replaceAll(%D7,f)
  .replaceAll(%D8,q)
  .replaceAll(%D9,R)
  .replaceAll(%DA,y)
  .replaceAll(%DB,S)
  .replaceAll(%DC,C)
  .replaceAll(%DD,c)
  .replaceAll(%DE,Z)
  .replaceAll(%DF,w)
  .replaceAll(%E0,W)
  .replaceAll(%E1,x)
  .replaceAll(%E3,j)
  .replaceAll(%E4,h);
  return b;
}

public GeorgianStemmer() {
}

/**
 * Stemms the given term to an unique ttdiscriminator/tt.
 *
 * @param word  The term that should be stemmed.
 * @return  Discriminator for ttterm/tt
 */
protected String stem( String term ) {
String stem = term;
String instring;
try {
   URL url = new URL(http://127.0.0.1:8042/?+encodeIKE(term));
// Create the URL
   URLConnection yc = url.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
yc.getInputStream()));
while ((instring = in.readLine()) != null)
{
 stem = recodeEscIKE(instring);
}
in.close();
}
catch (MalformedURLException e)
{
 return term;
}
catch (IOException e)
{
 return term;
}
if ((stem!=null)  (stem.charAt(0)!='_')  
(stem.charAt(0)!='U'))
{return stem;}
else
{return term;}
}

  public static void main(String args[]) {

   GeorgianStemmer stemmer = new GeorgianStemmer();
   System.out.println(stemmer.stem(args[1]));

  }
}

On Fri, Sep 12, 2008 at 1:34 PM, Lyndon Maydwell [EMAIL PROTECTED] wrote:
 I'd be happy to try trawl through the code for you :) I've been looking for
 stemming code that will run on 1.0 for ages now!




-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: intranet crawling

2008-09-04 Thread David Jashi
It may be a rude decision of that problem, but when I wanted ALL of my
video hosting site indexed, I simply generated list like

http://tvali.ge/index.php?action=watchv=495
http://tvali.ge/index.php?action=watchv=496
http://tvali.ge/index.php?action=watchv=497


from MySQL table, containing list of posts and put it into urls dir.

On Thu, Sep 4, 2008 at 6:56 PM, Edward Quick [EMAIL PROTECTED] wrote:

 Hi,

 I want to do an exhaustive scan of our intranet but running

 bin/nutch crawl urls -dir crawl -depth 9 -topN 50

 doesn't get everything. I've increased this now to

 bin/nutch crawl urls -dir crawl -depth 30 -topN 1000

 and it's certainly running longer but I'm not sure if this will still miss 
 any pages. Is there any way of doing this so I get an index of the whole 
 intranet?

 Thanks,

 Ed.

 _
 Win New York holidays with Kellogg's  Live Search
 http://clk.atdmt.com/UKM/go/111354033/direct/01/



-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]


Re: problems: crawling specific domain

2008-09-03 Thread David Jashi
Ever tried to use this one:
http://wiki.apache.org/nutch/Nutch_0%2e9_Crawl_Script_Tutorial ?
About single site crawl:
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm , part 4.


On Wed, Sep 3, 2008 at 8:53 AM, Mohammad Monirul Hoque
[EMAIL PROTECTED] wrote:

 Hi,

 How can i crawl specific domain only(like www.yellowpages.co.za)? What i have 
 to change to work things correctly?I tried with the change in 
 crawl-urlfilter.txt and nutch started crawling outside my domain after 
 sometimes.

 I am using nutch 0.9 in standalone mode(without hadoop).Can anyone gives me 
 some idea how to merge indexes from different crawl to a single indexes?

 Regards.
 --mohammad monirul hoque






-- 
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
[EMAIL PROTECTED]

პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
კავკასუს ონლაინი
+995(32)970368
[EMAIL PROTECTED]