Re: need your support

2010-01-20 Thread Mattmann, Chris A (388J)
Hi Sahar,

Can you post your:


 1.  crawl-urlfilter
 2.  nutch-site.xml

Also how are you running this program below?

I'm CC'ing nutch-user@ so the community can benefit from this thread.

Cheers,
Chris



On 1/20/10 1:42 PM, sahar elkazaz saharelka...@hotmail.com wrote:


Dear/ sirur

I have follow all steps on your article to run nutch

and use this java program to access the segments:

 package nutch;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.nutch.searcher.Hit;
import org.apache.nutch.searcher.HitDetails;
import org.apache.nutch.searcher.Hits;
import org.apache.nutch.searcher.NutchBean;
import org.apache.nutch.searcher.Query;
import org.apache.nutch.searcher.Summary;
import org.apache.nutch.util.NutchConfiguration;
public class nutch   {
  /** For debugging. */
  public static void main(String[] args) throws Exception {
 Configuration conf = NutchConfiguration.create();
   conf = NutchConfiguration.create();
  NutchBean bean = new NutchBean(conf);
Query query = Query.parse(animal +
, conf);
Hits hits = bean.search(query, 10);
System.out.println(Total hits:  + hits.getTotal());
int length = (int)Math.min(hits.getTotal(), 10);
Hit[] show = hits.getHits(0, length);
HitDetails[] details = bean.getDetails(show);
 Summary[] summaries = bean.getSummary(details, query);
 for ( int i = 0; i summaries.length-1;i++){
System.out.println(hh);
  System.out.println( +i+ + details[i] + \n + summaries[i]);
}
  }
}


and add the path of nutch to the classpath

but i recieve exceptions:
10/01/20 22:29:27 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
 nb sp;  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:89)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:77)
at nutch.nutch.main(nutch.java:25)
10/01/20 22:29:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation 
.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at 
org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:50)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:7 7)
at nutch.nutch.main(nutch.java:25)
10/01/20 22:29:28 INFO searcher.SearchBean: opening indexes in crawl/indexes
10/01/20 22:29:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1438)
 ;at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.IndexSearcher.init(IndexSearcher.java:59)
at 
org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:77)
at 
org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.init(NutchBean.java:102)
at 

Re: OR support

2009-12-14 Thread BrunoWL

Nobody?
Please, any answer would good.
-- 
View this message in context: 
http://old.nabble.com/OR-support-tp26680899p26779229.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: OR support

2009-12-14 Thread Andrzej Bialecki

On 2009-12-14 16:05, BrunoWL wrote:


Nobody?
Please, any answer would good.


Please check this issue:

https://issues.apache.org/jira/browse/NUTCH-479

That's the current status, i.e. this functionality is available only as 
a patch.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



OR support

2009-12-07 Thread BrunoWL

Hi!
Did anybody added the search with or operator in the nutch1.0
successfully?
i found a patch for the 0.9 version, but doesn't work. 

thanks.
-- 
View this message in context: 
http://old.nabble.com/OR-support-tp26680899p26680899.html
Sent from the Nutch - User mailing list archive at Nabble.com.



support for robot rules that include a wild card

2009-11-19 Thread J.G.Konrad
I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
  Jason


Re: support for robot rules that include a wild card

2009-11-19 Thread Ken Krugler

Hi Jason,

I've been spending some time on an improved robots.txt parser, as part  
of my Bixo project.


One aspect is support for Google wildcard extensions.

I think this will be part of the proposed crawler-commons project  
where we'll put components that can/should be shared between Nutch,  
Bixo, Heritrix and Droids.


One thing that would be useful is to collect examples of advanced  
robots.txt files, in addition to broken ones.


It would be great if you could open a Jira issue and attach specific  
examples of the above that you know about.


Thanks!

-- Ken


On Nov 19, 2009, at 11:31am, J.G.Konrad wrote:


I'm using nutch-1.0 and have noticed after running some tests that the
robot rules parser does not support wildcard (a.k.a globbing) in
rules. This means the rule will not work like it was expected to by
the person who wrote the robots.txt file.  For example

User-Agent: *
Disallow: /somepath/*/someotherpath

Even yahoo has one rule ( http://m.www.yahoo.com/robots.txt )
User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

With the popularity of the wildcard (*) in robots.txt files these days
what are the plans/thoughts on adding support for it in Nutch?

Thanks,
 Jason



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Shame on me! Thanks a lot.


 it's some thing like that



 property
  nameplugin.includes/name
  valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
 /property


 From: da...@jashi.ge
 Date: Tue, 29 Sep 2009 18:59:52 +0400
 Subject: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org

 Hello, all.

 I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

 I have fresh install of Nutch and two analysis plugins I'd like to turn on:
 analysis-de (German) and analysis-ge (Georgian)
 Here are the innards of my seed file:
 ---
 http://212.72.133.54/l/test.html
 http://212.72.133.54/l/de.html
 ---
 The first is Georgian, other - German. When I run

 bin/nutch crawl seed -dir crawl -threads 10 -depth 2

 there is not a slightest sign of someone calling any analysis
 plug-ins, even though it's clearly stated in hadoop.log, that they are
 on and active:
 ---
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
 injected urls to crawl db entries.
 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
 looking in: C:\cygwin\opt\nutch\plugins
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
 Auto-activation mode: [true]
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
 core extension points (nutch-extensionpoints)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
 Filter (query-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
 Analysers (lib-lucene-analyzers)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
 Normalizer (urlnormalizer-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
 Identification Parser/Filter (language-identifier)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
 Plug-in (parse-html)

 !
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
 Analysis Plug-in (analysis-ge)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
 Analysis Plug-in (analysis-de)
 !

 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Indexing Filter (index-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
 Summarizer Plug-in (summary-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
 Filter (query-site)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
 Framework (lib-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
 Plug-in (parse-text)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
 Filter (query-more)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Filter (urlfilter-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
 URL Normalizer (urlnormalizer-pass)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
 Plug-in (protocol-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
 Normalizer (urlnormalizer-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
 Plug-in (scoring-opic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
 HTML Parser (lib-nekohtml)
 2009-09-29 16:39

Re: Multilanguage support in Nutch 1.0

2009-09-30 Thread David Jashi
On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:

 hi

 try to activate the language-identifier plugin
 you must add it in the nutch-site.xml file in the  
 nameplugin.includes/name section.

Ooops. It IS activated.

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
Language Identification Parser/Filter (language-identifier)

But fetched pages are not passed to it, as I recon.


RE: Multilanguage support in Nutch 1.0

2009-09-30 Thread BELLINI ADAM

hi,
do you have some metadata 'lang' on the pages . becoz the plugin try first to 
get the language form metadata..
if you see in the java source of the plugin LanguageIndexingFilter.java


// check if LANGUAGE found, possibly put there by HTMLLanguageParser
String lang = parse.getData().getParseMeta().get(Metadata.LANGUAGE);

// check if HTTP-header tels us the language
if (lang == null) {
lang = parse.getData().getContentMeta().get(Response.CONTENT_LANGUAGE);
}

try to use also LUKE to check all your metadata on the index.





 From: da...@jashi.ge
 Date: Wed, 30 Sep 2009 17:22:26 +0400
 Subject: Re: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org
 
 On Wed, Sep 30, 2009 at 01:12, BELLINI ADAM mbel...@msn.com wrote:
 
  hi
 
  try to activate the language-identifier plugin
  you must add it in the nutch-site.xml file in the  
  nameplugin.includes/name section.
 
 Ooops. It IS activated.
 
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -
 Language Identification Parser/Filter (language-identifier)
 
 But fetched pages are not passed to it, as I recon.
  
_
Windows Live helps you keep up with all your friends, in one place.
http://go.microsoft.com/?linkid=9660826

Multilanguage support in Nutch 1.0

2009-09-29 Thread David Jashi
Hello, all.

I've got a bit of a trouble with Nutch 1.0 and multilanguage support:

I have fresh install of Nutch and two analysis plugins I'd like to turn on:
analysis-de (German) and analysis-ge (Georgian)
Here are the innards of my seed file:
---
http://212.72.133.54/l/test.html
http://212.72.133.54/l/de.html
---
The first is Georgian, other - German. When I run

bin/nutch crawl seed -dir crawl -threads 10 -depth 2

there is not a slightest sign of someone calling any analysis
plug-ins, even though it's clearly stated in hadoop.log, that they are
on and active:
---
2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.
2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
looking in: C:\cygwin\opt\nutch\plugins
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         the nutch
core extension points (nutch-extensionpoints)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic Query
Filter (query-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Lucene
Analysers (lib-lucene-analyzers)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Language
Identification Parser/Filter (language-identifier)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Html Parse
Plug-in (parse-html)

!
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Georgian
Analysis Plug-in (analysis-ge)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         German
Analysis Plug-in (analysis-de)
!

2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Indexing Filter (index-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Basic
Summarizer Plug-in (summary-basic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Site Query
Filter (query-site)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         HTTP
Framework (lib-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Text Parse
Plug-in (parse-text)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         More Query
Filter (query-more)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter (urlfilter-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Pass-through
URL Normalizer (urlnormalizer-pass)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         CyberNeko
HTML Parser (lib-nekohtml)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         JavaScript
Parser (parse-js)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         URL Query
Filter (query-url)
2009-09-29 16:39:15,671 INFO  plugin.PluginRepository -         Regex URL
Filter Framework (lib-regex-filter)
---

At the same time:

---
2009-09-29 16:39:54,406 INFO  lang.LanguageIdentifier - Language
identifier configuration [1-4/2048]
2009-09-29 16:39:54,609 INFO  lang.LanguageIdentifier - Language
identifier plugin supports: it(1000) is(1000) hu(1000) th(1000)
sv(1000) ge(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000)
el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000)
nl(1000)
---

Language indentifier works as a charm at the same time:
---
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/test.html
text was identified as ge
---
$ bin/nutch plugin language-identifier
org.apache.nutch.analysis.lang.LanguageIdentifier -identifyurl
http://212.72.133.54/l/de.html
text was identified as de
---

What could have possibly gone wrong?

პატივისცემით,
დავით ჯაში


RE: Multilanguage support in Nutch 1.0

2009-09-29 Thread BELLINI ADAM

hi 

try to activate the language-identifier plugin
you must add it in the nutch-site.xml file in the  nameplugin.includes/name 
section.

it's some thing like that 



property
  nameplugin.includes/name
  
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
/property


 From: da...@jashi.ge
 Date: Tue, 29 Sep 2009 18:59:52 +0400
 Subject: Multilanguage support in Nutch 1.0
 To: nutch-user@lucene.apache.org
 
 Hello, all.
 
 I've got a bit of a trouble with Nutch 1.0 and multilanguage support:
 
 I have fresh install of Nutch and two analysis plugins I'd like to turn on:
 analysis-de (German) and analysis-ge (Georgian)
 Here are the innards of my seed file:
 ---
 http://212.72.133.54/l/test.html
 http://212.72.133.54/l/de.html
 ---
 The first is Georgian, other - German. When I run
 
 bin/nutch crawl seed -dir crawl -threads 10 -depth 2
 
 there is not a slightest sign of someone calling any analysis
 plug-ins, even though it's clearly stated in hadoop.log, that they are
 on and active:
 ---
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - crawl started in: crawl
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - rootUrlDir = seed
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - threads = 10
 2009-09-29 16:39:13,328 INFO  crawl.Crawl - depth = 2
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: starting
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2009-09-29 16:39:13,375 INFO  crawl.Injector - Injector: urlDir: seed
 2009-09-29 16:39:13,390 INFO  crawl.Injector - Injector: Converting
 injected urls to crawl db entries.
 2009-09-29 16:39:13,421 WARN  mapred.JobClient - Use
 GenericOptionsParser for parsing the arguments. Applications should
 implement Tool for the same.
 2009-09-29 16:39:15,546 INFO  plugin.PluginRepository - Plugins:
 looking in: C:\cygwin\opt\nutch\plugins
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Plugin
 Auto-activation mode: [true]
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Registered Plugins:
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - the nutch
 core extension points (nutch-extensionpoints)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic Query
 Filter (query-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Lucene
 Analysers (lib-lucene-analyzers)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic URL
 Normalizer (urlnormalizer-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Language
 Identification Parser/Filter (language-identifier)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Html Parse
 Plug-in (parse-html)
 
 !
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Georgian
 Analysis Plug-in (analysis-ge)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - German
 Analysis Plug-in (analysis-de)
 !
 
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic
 Indexing Filter (index-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Basic
 Summarizer Plug-in (summary-basic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Site Query
 Filter (query-site)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - HTTP
 Framework (lib-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Text Parse
 Plug-in (parse-text)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - More Query
 Filter (query-more)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Regex URL
 Filter (urlfilter-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Pass-through
 URL Normalizer (urlnormalizer-pass)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Http Protocol
 Plug-in (protocol-http)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - Regex URL
 Normalizer (urlnormalizer-regex)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - OPIC Scoring
 Plug-in (scoring-opic)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - CyberNeko
 HTML Parser (lib-nekohtml)
 2009-09-29 16:39:15,671 INFO  plugin.PluginRepository - JavaScript
 Parser (parse-js)
 2009-09-29 16:39

Development support

2009-07-28 Thread Koch Martina
Hi,

we're looking for a Nutch developer to implement some plugins for us in the 
next few weeks.
Substantial knowledge in Nutch, Java and Databases is needed.

If yor're interested, please contact me (koch at  huberverlag dot de)

Thanks in advance,

Martina



Re: Support needed

2009-07-28 Thread Sudhi Seshachala
As a very old nutch user an developer of plugins and even implemented nutch in 
some products - I could help you.
I am based in Houston, Texas -- skype me on hooduku

sudhi

--- On Mon, 7/27/09, sf30098 sf30...@yahoo.com wrote:

From: sf30098 sf30...@yahoo.com
Subject: Support needed
To: nutch-user@lucene.apache.org
Date: Monday, July 27, 2009, 4:01 PM


I need someone with substantial knowledge in Nutch, Java and Lucene and have
customised the system before. In particular, this should be related to image
indexing and geo-positioning. 

if possible (either or, is good as well).

The job role will be on providing supports and advice on how to go about
implementing such system..

This includes:
1. replying questions and providing guidance in implementation
2. reviewing codes and providing suggestions as to how to improve. 

Please let me know if you're interested.
-- 
View this message in context: 
http://www.nabble.com/Support-needed-tp24688172p24688172.html
Sent from the Nutch - User mailing list archive at Nabble.com.




  

Support needed

2009-07-27 Thread sf30098

I need someone with substantial knowledge in Nutch, Java and Lucene and have
customised the system before. In particular, this should be related to image
indexing and geo-positioning. 

if possible (either or, is good as well).

The job role will be on providing supports and advice on how to go about
implementing such system..

This includes:
1. replying questions and providing guidance in implementation
2. reviewing codes and providing suggestions as to how to improve. 

Please let me know if you're interested.
-- 
View this message in context: 
http://www.nabble.com/Support-needed-tp24688172p24688172.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Multi-Lingual Support in Nutch

2009-04-13 Thread Kunal Wku
Hello,

I am using Nutch 0.9. I would like to enable multi-lingual support in our 
existing system. I read the article on Multi-Lingual Support in Nutch by Jérôme 
Charron. But it is about the previous versions of Nutch. I included the plugin 
in Nutch-Site.xml as analysis-es. What are the other steps to be followed to 
enable multi-lingual support ?

Thanks  Regards,
Kunal



  

Professional Nutch Support and Distribution

2009-03-17 Thread Dennis Kubes
Wanted to gauge community interest in having a certified Nutch 
distribution with support?  Similar to what Lucid Imagination is doing 
for Solr and Lucene and what Cloudera is providing for Hadoop.  Anybody 
interested?


Dennis


Re: Professional Nutch Support and Distribution

2009-03-17 Thread Marc Boucher
This sounds interesting. I might be interested in this.

Marc Boucher
http://hyperix.com

On Tue, Mar 17, 2009 at 12:31 PM, Dennis Kubes ku...@apache.org wrote:
 Wanted to gauge community interest in having a certified Nutch distribution
 with support?  Similar to what Lucid Imagination is doing for Solr and
 Lucene and what Cloudera is providing for Hadoop.  Anybody interested?

 Dennis



Does Nutch support the boolean OR operator in a search query?

2009-01-19 Thread M S Ram

Hi,

Does Nutch support the boolean OR operator (or something similar) in a 
search query? I mean is there any class already available to do this? 
The Nutch search interface doesn't seem to have this option.


Expcted functionality: If I ask it to search for (Post Graduate) OR 
(Masters), it should fetch the pages which contain at least one of 
{Post Graduate, Masters}.


Thank you,
Ram.


Re: Does Nutch support the boolean OR operator in a search query?

2009-01-19 Thread Doğacan Güney
Hi,

On Mon, Jan 19, 2009 at 4:02 PM, M S Ram ms...@cse.iitk.ac.in wrote:
 Hi,

 Does Nutch support the boolean OR operator (or something similar) in a
 search query? I mean is there any class already available to do this? The
 Nutch search interface doesn't seem to have this option.

 Expcted functionality: If I ask it to search for (Post Graduate) OR
 (Masters), it should fetch the pages which contain at least one of {Post
 Graduate, Masters}.


Unfortunately no.

There is an issue with a patch

https://issues.apache.org/jira/browse/NUTCH-479

but nothing happened for a while.

 Thank you,
 Ram.




-- 
Doğacan Güney


Re: Does Nutch support the boolean OR operator in a search query?

2009-01-19 Thread M S Ram
Oh! That's sad! :( What is the best approach to provide an OR search 
now? Should I go down to Lucene? Does Lucene understand HDFS? Please 
help me with the appropriate guide lines.


Thank you,
Ram

Doğacan Güney wrote:

Hi,

On Mon, Jan 19, 2009 at 4:02 PM, M S Ram ms...@cse.iitk.ac.in wrote:
  

Hi,

Does Nutch support the boolean OR operator (or something similar) in a
search query? I mean is there any class already available to do this? The
Nutch search interface doesn't seem to have this option.

Expcted functionality: If I ask it to search for (Post Graduate) OR
(Masters), it should fetch the pages which contain at least one of {Post
Graduate, Masters}.




Unfortunately no.

There is an issue with a patch

https://issues.apache.org/jira/browse/NUTCH-479

but nothing happened for a while.

  

Thank you,
Ram.






  




Re: Does Nutch support the boolean OR operator in a search query?

2009-01-19 Thread Lyndon Maydwell
Lucene has support for OR queries, so it should be possible to do it,
but support for this in nutch isn't available as far as I know. I'd
also be intersted if anyone has managed to implement this.

On Tue, Jan 20, 2009 at 1:50 AM, M S Ram ms...@cse.iitk.ac.in wrote:
 Oh! That's sad! :( What is the best approach to provide an OR search now?
 Should I go down to Lucene? Does Lucene understand HDFS? Please help me with
 the appropriate guide lines.

 Thank you,
 Ram

 Doğacan Güney wrote:

 Hi,

 On Mon, Jan 19, 2009 at 4:02 PM, M S Ram ms...@cse.iitk.ac.in wrote:


 Hi,

 Does Nutch support the boolean OR operator (or something similar) in a
 search query? I mean is there any class already available to do this? The
 Nutch search interface doesn't seem to have this option.

 Expcted functionality: If I ask it to search for (Post Graduate) OR
 (Masters), it should fetch the pages which contain at least one of {Post
 Graduate, Masters}.



 Unfortunately no.

 There is an issue with a patch

 https://issues.apache.org/jira/browse/NUTCH-479

 but nothing happened for a while.



 Thank you,
 Ram.










does nutch support crawling cold fusion pages?

2008-12-08 Thread Alex Basa
Hi,

Does anyone know if there is a plugin for cold fusion pages or if it's 
supported?  I'm trying to crawl

http://www.knowitall.org/naturalstate

Thanks in advance,

Alex


  


What kind of searches does Nutch support?

2008-05-04 Thread Miao Liqiang NCS
What kind of searches does Nutch support?



Missing zh.ngp for zh locate support for language Identifier

2008-03-15 Thread Vinci

Hi all,

I found there is missing zh.ngp for zh locate. I have seen this file via a
screenshot and then I googled the filename return nothing for me...can
anyone provide this file for me?

Thank you
-- 
View this message in context: 
http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-language-Identifier-tp16068532p16068532.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Support Hardware and OS for nutch and hadoop

2008-01-04 Thread Developer Developer
Hello Frens,

I am gathering information on supoorted hardware and OS for nutch and hadoop
. I did not find any conclusive information by going thru Nutch wiki.

If I want to build a cluster of nodes using nutch/hadoop for crawling then
what are my options for H/W and OS ?


Prefix Query in Nutch and Wildcard support.

2008-01-03 Thread Developer Developer
Hello Frens,

Is there anyway to do prefix query in Nutch ? Eg Query the content field for
the occurance of abc* ? I could do it in Lucene,  but i want to do it in
nuthch . Going through the mialing list it appeared that Nutch does not
support such queries. Is it ture ?

Thanks !


Re: NUTCH-479 Support for OR queries - what is this about

2007-07-09 Thread Briggs

Thanks for the answer. That was helpful.

I was sooo wrong.

On 7/7/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Briggs wrote:
 Please keep this thread going as I am also curious to know why this
 has been 'forked'.   I am sure that most of this lies within the
 original OPIC filter but I still can't understand why straight forward
 lucene queries have not been used within the application.

No, this has actually almost nothing to do with the scoring filters
(which were added much later).

The decision to use a different query syntax than the one from Lucene
was motivated by a few reasons:

* to avoid the need to support low-level index and searcher operations,
which the Lucene API would require us to implement.

* to keep the Nutch core largely independent of Lucene, so that it's
possible to use Nutch with different back-end searcher implementations.
This started to materialize only now, with the ongoing effort to use
Solr as a possible backend.

* to limit the query syntax to those queries that provide best tradeoff
between functionality and performance, in a large-scale search engine.


 On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

 Ok, so I guess what I don't understand is what is the Nutch query
 syntax?

Query syntax is defined in an informal way on the Help page in
nutch.war, or here:

http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned from
org.apache.nutch.analysis.NutchAnalysis.jj.




 The main discussion I found on nutch-user is this:
 http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
 I was wondering why the query syntax is so limited.
 There are no OR queries, there are no fielded queries,
 or fuzzy, or approximate... Why? The underlying index
 supports all these operations.


Actually, it's possible to configure Nutch to allow raw field queries -
you need to add a raw field query plugin for this. Pleae see
RawFieldQueryFilter class, and existing plugins that use fielded
queries: query-site, and query-more. Query-more / DateQueryFilter is
especially interesting, because it shows how to use raw token values
from a parsed query to build complex Lucene queries.



 I notice by looking at the or.patch file
 (https://issues.apache.org/jira/secure/attachment/12360659/or.patch)
 that one of the programs under consideration is:
 nutch/searcher/Query.java
 The code for this is distinct from
 lucene/search/Query.java


See above - they are completely different classes, with completely
different purpose. The use of the same class name is unfortunate and
misleading.

Nutch Query class is intended to express queries entered by search
engine users, in a tokenized and parsed way, so that the rest of Nutch
may deal with Clauses, Terms and Phrases instead of plain String-s.

On the other hand, Lucene Query is intended to express arbitrarily
complex Lucene queries - many of these queries would be prohibitively
expensive for a large search engine (e.g. wildcard queries).



 It looks like this is an architecture issue that I don't understand.
 If nutch is an extension of lucene, why does it define a different
 Query class?

Nutch is NOT an extension of Lucene. It's an application that uses
Lucene as a library.


  Why don't we just use the Lucene code to query the
 indexes?  Does this have something to do with the nutch webapp
 (nutch.war)?  What is the historical genesis of this issue (or is that
 even relevant)?

Nutch webapp doesn't have anything to do with it. The limitations in the
query syntax have different roots (see above).

--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
Conscious decisions by conscious minds are what make reality real


Re: NUTCH-479 Support for OR queries - what is this about

2007-07-07 Thread Briggs

Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.



On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

I've been reading up on NUTCH-479 Support for OR queries but I must be 
missing something obvious because I don't understand what the JIRA is about:

https://issues.apache.org/jira/browse/NUTCH-479

   Description:
   There have been many requests from users to extend Nutch query syntax

   to add support for OR queries,
   in addition to the implicit AND and NOT
queries supported now.

Ok, so I guess what I don't understand is what is the Nutch query syntax?

The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.

I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one 
of the programs under consideration is:
nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java

It looks like this is an architecture issue that I don't understand.  If nutch is an 
extension of lucene, why does it define a different Query class?  Why don't 
we just use the Lucene code to query the indexes?  Does this have something to do with 
the nutch webapp (nutch.war)?  What is the historical genesis of this issue (or is that 
even relevant)?







We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265



--
Conscious decisions by conscious minds are what make reality real


Re: NUTCH-479 Support for OR queries - what is this about

2007-07-07 Thread Andrzej Bialecki

Briggs wrote:

Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.


No, this has actually almost nothing to do with the scoring filters 
(which were added much later).


The decision to use a different query syntax than the one from Lucene 
was motivated by a few reasons:


* to avoid the need to support low-level index and searcher operations, 
which the Lucene API would require us to implement.


* to keep the Nutch core largely independent of Lucene, so that it's 
possible to use Nutch with different back-end searcher implementations. 
This started to materialize only now, with the ongoing effort to use 
Solr as a possible backend.


* to limit the query syntax to those queries that provide best tradeoff 
between functionality and performance, in a large-scale search engine.




On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:


Ok, so I guess what I don't understand is what is the Nutch query 
syntax?


Query syntax is defined in an informal way on the Help page in 
nutch.war, or here:


http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned from 
org.apache.nutch.analysis.NutchAnalysis.jj.





The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.


Actually, it's possible to configure Nutch to allow raw field queries - 
you need to add a raw field query plugin for this. Pleae see 
RawFieldQueryFilter class, and existing plugins that use fielded 
queries: query-site, and query-more. Query-more / DateQueryFilter is 
especially interesting, because it shows how to use raw token values 
from a parsed query to build complex Lucene queries.





I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) 
that one of the programs under consideration is:

nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java


See above - they are completely different classes, with completely 
different purpose. The use of the same class name is unfortunate and 
misleading.


Nutch Query class is intended to express queries entered by search 
engine users, in a tokenized and parsed way, so that the rest of Nutch 
may deal with Clauses, Terms and Phrases instead of plain String-s.


On the other hand, Lucene Query is intended to express arbitrarily 
complex Lucene queries - many of these queries would be prohibitively 
expensive for a large search engine (e.g. wildcard queries).





It looks like this is an architecture issue that I don't understand.  
If nutch is an extension of lucene, why does it define a different 
Query class?


Nutch is NOT an extension of Lucene. It's an application that uses 
Lucene as a library.



 Why don't we just use the Lucene code to query the 
indexes?  Does this have something to do with the nutch webapp 
(nutch.war)?  What is the historical genesis of this issue (or is that 
even relevant)?


Nutch webapp doesn't have anything to do with it. The limitations in the 
query syntax have different roots (see above).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



NUTCH-479 Support for OR queries - what is this about

2007-07-06 Thread Kai_testing Middleton
I've been reading up on NUTCH-479 Support for OR queries but I must be 
missing something obvious because I don't understand what the JIRA is about:

https://issues.apache.org/jira/browse/NUTCH-479

   Description:
   There have been many requests from users to extend Nutch query syntax

   to add support for OR queries, 
   in addition to the implicit AND and NOT
queries supported now.

Ok, so I guess what I don't understand is what is the Nutch query syntax? 

The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.

I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one 
of the programs under consideration is:
nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java

It looks like this is an architecture issue that I don't understand.  If nutch 
is an extension of lucene, why does it define a different Query class?  Why 
don't we just use the Lucene code to query the indexes?  Does this have 
something to do with the nutch webapp (nutch.war)?  What is the historical 
genesis of this issue (or is that even relevant)?





 

We won't tell. Get more on shows you hate to love 
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265 

How best to add sponsored link support..??

2006-12-19 Thread RP

Hi all,

I've been tasked with looking into this and am not a coder - that said, 
Nutch  is doing great and the bean counters have asked me to look into 
adding sponsored link results and I'm wondering how best to add this.


It would be nice to utilize the Nutch engine to come up with the pages 
versus just doing a lookup on words and results in a flat file but the 
key word data could change daily (hourly) and would need to be able to 
be hand entered (or automated) as people sign up (re-index is not really 
an option).  I'm not sure this would fly within the main Nutch segments 
and index, but I could see maybe a separate index or possibly adding a 
flag to the existing data but I've not seen any easy to use tools to 
change/update/insert records into what is already there (yes Luke on the 
index but that does not touch the segment data, right?).  I don't want 
to change existing searched data and I don't see an issue with having 
duplicate results (sponsored up top and existing entry down below 
somewhere) but it would be more elegant to not have that occur.  I also 
see issues in a simple flat file look up as a multiple word search is 
best handled inside Nutch to score the results versus having to do 
something similar in the sponsored results.  I can see the need to 
control the summary text displayed and also pass thru any codes in the 
URL which are currently being stripped during the main crawl/index 
cycle.  I also see issues with seriously customizing the internals as 
they would have to be maintained as Nutch itself is updated


If anyone has looked at this and has at least some ideas on how best to 
do this let me know.  I need to come up with a preliminary estimate 
before I can engage and pay the coders to make this happen so if there 
are any easy or best practices ways on doing this any help/pointers 
would be appreciated


--
rp





Re: How best to add sponsored link support..??

2006-12-19 Thread RP
Let me qualify this - ad banner rotation is dealt with - I'm looking for 
something that will use our Nutch engine to serve up relevant links from 
people who pay for that privilege.  We do not want to serve up ad's from 
someone else's system i.e. the big G or Y, but use our own Nutch search 
results to serve up relevant paying links that we have sold and 
maintain.   In a simple relational SQL world we would add a flag and 
another table with the links and scores and look that up and pass back 
when needed.  Problem with that is that we lose the whole multi word 
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a 
Chicago pizza ad first and beer ads further down, just like our search 
results have relevancy (not a great example but you get the idea). 
Re-writing a scoring engine to do that in SQL seems like a waste when 
Nutch already does it just fine.


So in a nutshell - we need to do what the big G and Y and other do when 
serving up key word based sponsor links.  My thought - automate the 
build of a dummy page with the key words bought that would be indexed 
and served up just like regular crawled and indexed pages, using the 
scoring to rank them in terms of relevancy and placement - I have not 
seen any snippets of code to do simple insert/update/delete operations 
on a Nutch segment or index however


This is the idea gathering phase - think like a school/college search 
engine with local paying advertisers - we want to serve those links up 
to the searchers to help offset the cost of the service and serve up or 
flag links that rank first because of payment followed by normal search 
link results


rp

Sean Dean wrote:

I might be totally off base with what your asking to do, but take a look at 
this open source project: http://phpadsnew.com/two/.
 
Its basically an advertising engine, built on PHP. Integration within any application is a breeze, and it supports external advertising such as Google Ads.


Sean

- Original Message 
From: RP [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Tuesday, December 19, 2006 10:52:56 AM
Subject: How best to add sponsored link support..??


Hi all,

I've been tasked with looking into this and am not a coder - that said, 
Nutch  is doing great and the bean counters have asked me to look into 
adding sponsored link results and I'm wondering how best to add this.


It would be nice to utilize the Nutch engine to come up with the pages 
versus just doing a lookup on words and results in a flat file but the 
key word data could change daily (hourly) and would need to be able to 
be hand entered (or automated) as people sign up (re-index is not really 
an option).  I'm not sure this would fly within the main Nutch segments 
and index, but I could see maybe a separate index or possibly adding a 
flag to the existing data but I've not seen any easy to use tools to 
change/update/insert records into what is already there (yes Luke on the 
index but that does not touch the segment data, right?).  I don't want 
to change existing searched data and I don't see an issue with having 
duplicate results (sponsored up top and existing entry down below 
somewhere) but it would be more elegant to not have that occur.  I also 
see issues in a simple flat file look up as a multiple word search is 
best handled inside Nutch to score the results versus having to do 
something similar in the sponsored results.  I can see the need to 
control the summary text displayed and also pass thru any codes in the 
URL which are currently being stripped during the main crawl/index 
cycle.  I also see issues with seriously customizing the internals as 
they would have to be maintained as Nutch itself is updated


If anyone has looked at this and has at least some ideas on how best to 
do this let me know.  I need to come up with a preliminary estimate 
before I can engage and pay the coders to make this happen so if there 
are any easy or best practices ways on doing this any help/pointers 
would be appreciated


  




Re: How best to add sponsored link support..??

2006-12-19 Thread Sami Siren

Are you looking for something like the google keymatch as described in [1]
which was then more or less mimiced in nutch web2 module[1],
and since also atleast as a lookalike released in google code [3]

--
Sami Siren

[1] http://www.google.com/enterprise/mini/end_user_features.html
[2]
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/
[3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java

2006/12/19, RP [EMAIL PROTECTED]:


Let me qualify this - ad banner rotation is dealt with - I'm looking for
something that will use our Nutch engine to serve up relevant links from
people who pay for that privilege.  We do not want to serve up ad's from
someone else's system i.e. the big G or Y, but use our own Nutch search
results to serve up relevant paying links that we have sold and
maintain.   In a simple relational SQL world we would add a flag and
another table with the links and scores and look that up and pass back
when needed.  Problem with that is that we lose the whole multi word
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a
Chicago pizza ad first and beer ads further down, just like our search
results have relevancy (not a great example but you get the idea).
Re-writing a scoring engine to do that in SQL seems like a waste when
Nutch already does it just fine.

So in a nutshell - we need to do what the big G and Y and other do when
serving up key word based sponsor links.  My thought - automate the
build of a dummy page with the key words bought that would be indexed
and served up just like regular crawled and indexed pages, using the
scoring to rank them in terms of relevancy and placement - I have not
seen any snippets of code to do simple insert/update/delete operations
on a Nutch segment or index however

This is the idea gathering phase - think like a school/college search
engine with local paying advertisers - we want to serve those links up
to the searchers to help offset the cost of the service and serve up or
flag links that rank first because of payment followed by normal search
link results

rp

Sean Dean wrote:
 I might be totally off base with what your asking to do, but take a look
at this open source project: http://phpadsnew.com/two/.

 Its basically an advertising engine, built on PHP. Integration within
any application is a breeze, and it supports external advertising such as
Google Ads.

 Sean

 - Original Message 
 From: RP [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, December 19, 2006 10:52:56 AM
 Subject: How best to add sponsored link support..??


 Hi all,

 I've been tasked with looking into this and am not a coder - that said,
 Nutch  is doing great and the bean counters have asked me to look into
 adding sponsored link results and I'm wondering how best to add this.

 It would be nice to utilize the Nutch engine to come up with the pages
 versus just doing a lookup on words and results in a flat file but the
 key word data could change daily (hourly) and would need to be able to
 be hand entered (or automated) as people sign up (re-index is not really
 an option).  I'm not sure this would fly within the main Nutch segments
 and index, but I could see maybe a separate index or possibly adding a
 flag to the existing data but I've not seen any easy to use tools to
 change/update/insert records into what is already there (yes Luke on the
 index but that does not touch the segment data, right?).  I don't want
 to change existing searched data and I don't see an issue with having
 duplicate results (sponsored up top and existing entry down below
 somewhere) but it would be more elegant to not have that occur.  I also
 see issues in a simple flat file look up as a multiple word search is
 best handled inside Nutch to score the results versus having to do
 something similar in the sponsored results.  I can see the need to
 control the summary text displayed and also pass thru any codes in the
 URL which are currently being stripped during the main crawl/index
 cycle.  I also see issues with seriously customizing the internals as
 they would have to be maintained as Nutch itself is updated

 If anyone has looked at this and has at least some ideas on how best to
 do this let me know.  I need to come up with a preliminary estimate
 before I can engage and pay the coders to make this happen so if there
 are any easy or best practices ways on doing this any help/pointers
 would be appreciated






Re: How best to add sponsored link support..??

2006-12-19 Thread RP

Thanks Sami,

This is closer from an initial look - does this do anything on the 
backend (i.e. defining the data flags sow e can get a match) as well or 
do we need to build that..??


Sami Siren wrote:
Are you looking for something like the google keymatch as described in 
[1]

which was then more or less mimiced in nutch web2 module[1],
and since also atleast as a lookalike released in google code [3]

--
Sami Siren

[1] http://www.google.com/enterprise/mini/end_user_features.html
[2]
http://svn.apache.org/viewvc/lucene/nutch/trunk/contrib/web2/plugins/web-keymatch/ 


[3] http://custom-keymatch-onebox.googlecode.com/svn/trunk/Keymatch.java

2006/12/19, RP [EMAIL PROTECTED]:


Let me qualify this - ad banner rotation is dealt with - I'm looking for
something that will use our Nutch engine to serve up relevant links from
people who pay for that privilege.  We do not want to serve up ad's from
someone else's system i.e. the big G or Y, but use our own Nutch search
results to serve up relevant paying links that we have sold and
maintain.   In a simple relational SQL world we would add a flag and
another table with the links and scores and look that up and pass back
when needed.  Problem with that is that we lose the whole multi word
scoring capability in Nutch i.e. pizza beer Chicago, should serve up a
Chicago pizza ad first and beer ads further down, just like our search
results have relevancy (not a great example but you get the idea).
Re-writing a scoring engine to do that in SQL seems like a waste when
Nutch already does it just fine.

So in a nutshell - we need to do what the big G and Y and other do when
serving up key word based sponsor links.  My thought - automate the
build of a dummy page with the key words bought that would be indexed
and served up just like regular crawled and indexed pages, using the
scoring to rank them in terms of relevancy and placement - I have not
seen any snippets of code to do simple insert/update/delete operations
on a Nutch segment or index however

This is the idea gathering phase - think like a school/college search
engine with local paying advertisers - we want to serve those links up
to the searchers to help offset the cost of the service and serve up or
flag links that rank first because of payment followed by normal search
link results

rp

Sean Dean wrote:
 I might be totally off base with what your asking to do, but take a 
look

at this open source project: http://phpadsnew.com/two/.

 Its basically an advertising engine, built on PHP. Integration within
any application is a breeze, and it supports external advertising 
such as

Google Ads.

 Sean

 - Original Message 
 From: RP [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Tuesday, December 19, 2006 10:52:56 AM
 Subject: How best to add sponsored link support..??


 Hi all,

 I've been tasked with looking into this and am not a coder - that 
said,

 Nutch  is doing great and the bean counters have asked me to look into
 adding sponsored link results and I'm wondering how best to add this.

 It would be nice to utilize the Nutch engine to come up with the pages
 versus just doing a lookup on words and results in a flat file but the
 key word data could change daily (hourly) and would need to be able to
 be hand entered (or automated) as people sign up (re-index is not 
really
 an option).  I'm not sure this would fly within the main Nutch 
segments

 and index, but I could see maybe a separate index or possibly adding a
 flag to the existing data but I've not seen any easy to use tools to
 change/update/insert records into what is already there (yes Luke 
on the

 index but that does not touch the segment data, right?).  I don't want
 to change existing searched data and I don't see an issue with having
 duplicate results (sponsored up top and existing entry down below
 somewhere) but it would be more elegant to not have that occur.  I 
also

 see issues in a simple flat file look up as a multiple word search is
 best handled inside Nutch to score the results versus having to do
 something similar in the sponsored results.  I can see the need to
 control the summary text displayed and also pass thru any codes in the
 URL which are currently being stripped during the main crawl/index
 cycle.  I also see issues with seriously customizing the internals as
 they would have to be maintained as Nutch itself is updated

 If anyone has looked at this and has at least some ideas on how 
best to

 do this let me know.  I need to come up with a preliminary estimate
 before I can engage and pay the coders to make this happen so if there
 are any easy or best practices ways on doing this any help/pointers
 would be appreciated








--
rp




Re: Lucene query support in Nutch

2006-10-10 Thread Stefan Neufeind
Cristina Belderrain wrote:
 On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote:
 
 This is *exactly* what I was thinking. Like Stefan, I believe the
 nutch analyzer is a good foundation and should therefore be extended
 to support the or operator, and possibly additional capabilities
 when the need arises.

 t.n.a.
 
 Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
 which does exactly what you want, is already there?

To what I understood so far in this thread the Nutch
analyser/query-whatever seems to be more targeted and provides
additional features regarding distributed search as well as maybe
speed-improvements due to it's nature etc. (Correct me if I'm wrong.)

One idea that has come up was to offer both as alternatives so you could
use Lucene-based queries if you need it's features on the   one hand but
can live with restrictions on the other.

However due to what has been mentioned so far it seems that
Lucene-queries by default can only be on document-content (is that
right?) not e.g. site:www.example.org. Hmm ...


PS: Thank you all for help offered so far in this thread on how to get
Lucene-queries going. Unfortunately I couldn't make much use of just
simply extend it here and there ... :-(


Regards,
 Stefan


Re: Lucene query support in Nutch

2006-10-10 Thread Tomi NA

2006/10/10, Cristina Belderrain [EMAIL PROTECTED]:

On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote:

 This is *exactly* what I was thinking. Like Stefan, I believe the
 nutch analyzer is a good foundation and should therefore be extended
 to support the or operator, and possibly additional capabilities
 when the need arises.

 t.n.a.

Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
which does exactly what you want, is already there?


Stefan basically answered that question, but basically, my opinion is
that Nutch's analyzer does it's job well, but only lacks one obvious
query capability: the or search. The fact that several users here
need this kind of functionality suggests it's not the beginning of a
landslide of new required capabilities. Lucene's analyzer, on the
other hand, is completely inadequate in this respect if search is
necessarily bound to a single (content) field.
In conclusion, my position is pragmatic: I welcome the simplest
solution to implement the or search. I just believe that it'd be
easiest to do that extending the nutch Analyzer.

t.n.a.


Re: Lucene query support in Nutch

2006-10-10 Thread Bill Goffe
Tomi said:

 In conclusion, my position is pragmatic: I welcome the simplest
 solution to implement the or search. I just believe that it'd be
 easiest to do that extending the nutch Analyzer.

This seems like a very reasonable approach. I too would very much like
OR. It would also be nice if it worked in 0.7.2 and I could drop it in,
but that may be asking for too much.

 - Bill

-- 
 *--*
 | Bill Goffe [EMAIL PROTECTED]  |
 | Department of Economicsvoice: (315) 312-3444 |
 | SUNY Oswegofax:   (315) 312-5444 |
 | 416 Mahar Hall http://cook.rfe.org |  
 | Oswego, NY  13126|
**--*---*
| Been there. Done that.  |
|   -- Ed Viesturs as he looked up Mount Everest. He climbed it five times, |
|  twice without oxygen. He now plans to be the first American to scale |
|  all of the world's 8,000 meter mountains. Climber for the Ages Has  |
|  Next Peak in View, New York Times, 2/13/00. |
*---*



Re: Lucene query support in Nutch

2006-10-09 Thread Tomi NA

2006/10/8, Stefan Neufeind [EMAIL PROTECTED]:


if it's not the full feature-set, maybe most people could live with it.
But basic boolean queries I think were the root for this topic. Is there
an easier way to allow this in Nutch as well instead of throwing quite
a bit away and using the Lucene-syntax? As has just been pointed out: It


This is *exactly* what I was thinking. Like Stefan, I believe the
nutch analyzer is a good foundation and should therefore be extended
to support the or operator, and possibly additional capabilities
when the need arises.

t.n.a.


Re: Lucene query support in Nutch

2006-10-09 Thread Cristina Belderrain

On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote:


This is *exactly* what I was thinking. Like Stefan, I believe the
nutch analyzer is a good foundation and should therefore be extended
to support the or operator, and possibly additional capabilities
when the need arises.

t.n.a.


Tomi, why would you extend Nutch's analyzer when Lucene's analyzer,
which does exactly what you want, is already there?

Regards,

Cristina


Re: Lucene query support in Nutch

2006-10-07 Thread Cristina Belderrain

Hello,

I just would like to confirm that the version of the search() method
shown in the previous post works fine, at least regarding boolean
queries. Anyway, I see no reason why it wouldn't work with any other
Lucene query (fuzzy, proximity, etc.).

Now, please be warned that the inclusion of this new method in
IndexSearcher has quite an impact on some other classes: besides
NutchBean, where you'll need to add the wrapper methods that will
allow its use there, you'll also need to add the new method signature
to the Searcher interface, which is implemented by IndexSearcher.

Since DistributedSearch implements the Searcher interface as well,
you'll need to provide there a method with the new siganature.
Besides, depending on your needs, Summarizer and Query will demand
some changes in order to preserve phrases (composite search terms)
when they are highlighted in the summary.

Let me remind you that all this must be done just to provide something
that's already there: Nutch is built on top of Lucene, after all. If
it's hard to understand why Lucene's capabilities were simply
neutralized in Nutch, it's even harder to figure out why no choice was
left to users by means of some configuration file.

Regards,

Cristina


Re: Lucene query support in Nutch

2006-10-07 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Am 07.10.2006 um 17:40 schrieb Cristina Belderrain:


Let me remind you that all this must be done just to provide something
that's already there: Nutch is built on top of Lucene, after all. If
it's hard to understand why Lucene's capabilities were simply
neutralized in Nutch, it's even harder to figure out why no choice was
left to users by means of some configuration file.


I think this issue is rooted in the underlying philosophy of Nutch:  
Nutch was designed with the idea of a possible Google(and the likes)- 
sized crawler and indexer in mind. Regular expressions and wildcard  
queries do not seem to fit into this philosophy, as such queries  
would be way less efficient on a huge data set than simple boolean  
queries.


Nevertheless, I agree that there should be an option to choose the  
Lucene query engine instead of the Nutch flavour one because Nutch  
has been proven to be equally suitable for areas which do not require  
as efficient queries (like intranet crawling for instance) as an all- 
out web indexing application.


- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJ+75gz0R1bg11MERAgT7AJ4mPRF8Z0BR2yLCm5Pxsz4VvtTI6QCfcS8b
q8gM8LQapjAloNIRwNV+osE=
=v7Lf
-END PGP SIGNATURE-


Re: Lucene query support in Nutch

2006-10-07 Thread Sami Siren


Nevertheless, I agree that there should be an option to choose the 
Lucene query engine instead of the Nutch flavour one because Nutch has 
been proven to be equally suitable for areas which do not require as 
efficient queries (like intranet crawling for instance) as an all-out 
web indexing application.


I agree also. Different query parsers could perhaps be made pluggable or 
at least configurable. The current(-alike) implementation could be the 
default one offered and by configuration one could switch it to 
intranet mode.


Contributions anyone?

--
 Sami Siren


Re: Lucene query support in Nutch

2006-10-07 Thread Stefan Neufeind
Björn Wilmsmann wrote:
 
 Am 07.10.2006 um 17:40 schrieb Cristina Belderrain:
 
 Let me remind you that all this must be done just to provide something
 that's already there: Nutch is built on top of Lucene, after all. If
 it's hard to understand why Lucene's capabilities were simply
 neutralized in Nutch, it's even harder to figure out why no choice was
 left to users by means of some configuration file.
 
 I think this issue is rooted in the underlying philosophy of Nutch:
 Nutch was designed with the idea of a possible Google(and the
 likes)-sized crawler and indexer in mind. Regular expressions and
 wildcard queries do not seem to fit into this philosophy, as such
 queries would be way less efficient on a huge data set than simple
 boolean queries.
 
 Nevertheless, I agree that there should be an option to choose the
 Lucene query engine instead of the Nutch flavour one because Nutch has
 been proven to be equally suitable for areas which do not require as
 efficient queries (like intranet crawling for instance) as an all-out
 web indexing application.

Hi,

if it's not the full feature-set, maybe most people could live with it.
But basic boolean queries I think were the root for this topic. Is there
an easier way to allow this in Nutch as well instead of throwing quite
a bit away and using the Lucene-syntax? As has just been pointed out: It
seems quite a few things need to be changed to use Lucene-search
instead of a Nutch-search. I don't think that it's needed in most cases.
But I see several reasons where a boolean query would make sense.

(Currently I do fetch up to 10.000 or so results using opensearch and
filter them in a script myself, since no AND (site:... or site:...) is
 yet possible.)


Regards,
 Stefan


Re: Lucene query support in Nutch

2006-10-05 Thread Stefan Neufeind
Hi,

yes, I guess having the full strength of Lucene-based queries would be
nice. That would as well solve the boolean queries-question I had a few
days ago :-)

Ravi, doesn't Lucene also allow querying of other fields? Is there any
possibility to add that feature to your proposal?


In general: What is the advantage of the current nutch-parser instead of
going with the Lucene-based one?


Regards,
 Stefan

Ravi Chintakunta wrote:
 Hi Cristina,
 
 You can achieve this by modifying the IndexSearcher to take the query
 String as an argument and then use
 
 org.apache.lucene.queryParser.QueryParser's parse(String ) method to
 parse the query string. The modified method in IndexSearcher would
 look as below:
 
 public Hits search(String queryString, int numHits,
 String dedupField, String sortField, boolean
 reverse)  throws IOException {
 
org.apache.lucene.queryParser.QueryParser parser = new
 org.apache.lucene.queryParser.QueryParser(content, new
 org.apache.lucene.analysis.standard.StandardAnalyzer());
 
   org.apache.lucene.search.Query luceneQuery = parser.parse(queryString);
 
   return translateHits
  (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
  sortField, reverse),
   dedupField, sortField);
  }
 
 For this you have to modify the code in search.jsp and NutchBean too,
 so that you are passing on the raw query string to IndexSearcher.
 
 Note that with this approach, you are limiting the search to the content
 field.
 
 
 - Ravi Chintakunta
 
 
 
 On 10/4/06, Cristina Belderrain [EMAIL PROTECTED] wrote:
 Hello,

 we all know that Lucene supports, among others, boolean queries. Even
 though Nutch is built on Lucene, boolean clauses are removed by Nutch
 filters so boolean queries end up as flat queries where terms are
 implicitly connected by an OR operator, as far as I can see.

 Is there any simple way to turn off the filtering so a boolean query
 remains as such after it is submitted to Nutch?

 Just in case a simple way doesn't exist, Ravi Chintakunta suggests the
 following workaround:

 We have to modify the analyzer and add more plugins to Nutch
 to use the Lucene's query syntax. Or we have to directly use
 Lucene's Query Parser. I tried the second approach by modifying
 org.apache.nutch.searcher.IndexSearcher and that seems to work.

 Can anyone please elaborate on what Ravi actually means by modifying
 org.apache.nutch.searcher.IndexSearcher? Which methods are supposed
 to be modified and how?

 It would be really nice to know how to do this. I believe many other
 Nutch users would also benefit from an answer to this question.

 Thanks so much,

 Cristina


Re: Lucene query support in Nutch

2006-10-05 Thread Björn Wilmsmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi everybody,


On 05/10/2006 05:44 Ravi Chintakunta wrote:


public Hits search(String queryString, int numHits,
String dedupField, String sortField, boolean
reverse)  throws IOException {

   org.apache.lucene.queryParser.QueryParser parser = new
org.apache.lucene.queryParser.QueryParser(content, new
org.apache.lucene.analysis.standard.StandardAnalyzer());

  org.apache.lucene.search.Query luceneQuery = parser.parse 
(queryString);


  return translateHits
 (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
 sortField, reverse),
  dedupField, sortField);
 }


This seems to be a good approach. I have not yet tried it out in  
detail, however, the method optimize() in LuceneQueryOptimizer does  
only take BooleanQuery as an argument, so the line 'return  
translateHits...'  would cause a compile error, wouldn't it?



- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm
sFAZIcCv3CoIBJC5g8FbOyo=
=vzdw
-END PGP SIGNATURE-


Re: Lucene query support in Nutch

2006-10-05 Thread Cristina Belderrain

Hi Björn,

yes, the error you point out will happen indeed... A possible
workaround would be:

   public Hits search(String queryString, int numHits,
   String dedupField, String sortField, boolean reverse)

   throws IOException {

   org.apache.lucene.queryParser.QueryParser parser = new
   org.apache.lucene.queryParser.QueryParser(content, new
   org.apache.lucene.analysis.standard.StandardAnalyzer());

   org.apache.lucene.search.Query luceneQuery = null;
   try {
   luceneQuery = parser.parse(queryString);
   } catch(Exception ex) {
   }

   org.apache.lucene.search.BooleanQuery boolQuery = new
   org.apache.lucene.search.BooleanQuery();
   boolQuery.add(luceneQuery,
   org.apache.lucene.search.BooleanClause.Occur.MUST);
   return translateHits
   (optimizer.optimize(boolQuery, luceneSearcher, numHits,
   sortField, reverse),
   dedupField, sortField);
   }

Please notice that I'm not sure this will work as it should: right
now, it just compiles... I still need to modify the NutchBean class so
it can pass on the raw query, as Ravi says.

Regards,

Cristina


On 10/5/06, Björn Wilmsmann [EMAIL PROTECTED] wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi everybody,


On 05/10/2006 05:44 Ravi Chintakunta wrote:

 public Hits search(String queryString, int numHits,
 String dedupField, String sortField, boolean
 reverse)  throws IOException {

org.apache.lucene.queryParser.QueryParser parser = new
 org.apache.lucene.queryParser.QueryParser(content, new
 org.apache.lucene.analysis.standard.StandardAnalyzer());

   org.apache.lucene.search.Query luceneQuery = parser.parse
 (queryString);

   return translateHits
  (optimizer.optimize(luceneQuery, luceneSearcher, numHits,
  sortField, reverse),
   dedupField, sortField);
  }

This seems to be a good approach. I have not yet tried it out in
detail, however, the method optimize() in LuceneQueryOptimizer does
only take BooleanQuery as an argument, so the line 'return
translateHits...'  would cause a compile error, wouldn't it?


- --
Best regards,
Björn Wilmsmann


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (Darwin)

iD8DBQFFJV9Fgz0R1bg11MERAt3sAJ4pKJ8voEhWSo+94SI6bam4iVPYgACbBQmm
sFAZIcCv3CoIBJC5g8FbOyo=
=vzdw
-END PGP SIGNATURE-


OpenOffice Support?

2006-07-11 Thread Matthew Holt
Just wondering, has anyone done any work on a plugin (or aware of a 
plugin) that supports the indexing of open office documents? Thanks.

Matt


Re: OpenOffice Support?

2006-07-11 Thread Lourival Júnior

Using to advantage your question, anyone knows if the version 0.7.2 of nutch
supports the zip plugin? If so, where can I find it?

Lourival Junior

On 7/11/06, Matthew Holt [EMAIL PROTECTED] wrote:


Just wondering, has anyone done any work on a plugin (or aware of a
plugin) that supports the indexing of open office documents? Thanks.
Matt





--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]


Re: Add Wyona to the wiki support page?

2006-06-21 Thread Andrzej Bialecki

Renaud Richardet wrote:

Hello Nutch,

My name is Renaud Richardet and I am the COO of Wyona LLC.  We are 
offering Nutch and Lucene support (http://wyona.com/lucene.html), and 
I was wondering if I could add our company to 
http://wiki.apache.org/nutch/Support. That would be great.


Certainly, you can add a short note about your company on the support 
page. It's a Wiki, so you can just create an account, log in, and edit 
this page (please use the preview button to check the changes before 
saving).


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Add Wyona to the wiki support page?

2006-06-21 Thread Insurance Squared Inc.
The funny thing about that wiki page (and some others in that area) is 
that they apparently use the nofollow tags.  Given the topic of that 
wiki, isn't that a bit odd?  I personally dislike the nofollow tag and 
think it should be used only in extreme circumstances (i.e. here's a 
link to a site you absolutely don't want to visit).  I believe in this 
case however it's simply being used so that sites that are listed don't 
get any pagerank/weight/whatever passed to them from an authority site.  
A really bizarre policy for a search related site IMO.


Swinging back on topic, does nutch obey the nofollow tags?

g.




Andrzej Bialecki wrote:


Renaud Richardet wrote:


Hello Nutch,

My name is Renaud Richardet and I am the COO of Wyona LLC.  We are 
offering Nutch and Lucene support (http://wyona.com/lucene.html), and 
I was wondering if I could add our company to 
http://wiki.apache.org/nutch/Support. That would be great.



Certainly, you can add a short note about your company on the support 
page. It's a Wiki, so you can just create an account, log in, and edit 
this page (please use the preview button to check the changes before 
saving).




Re: Add Wyona to the wiki support page?

2006-06-21 Thread Andrzej Bialecki

Insurance Squared Inc. wrote:
The funny thing about that wiki page (and some others in that area) is 
that they apparently use the nofollow tags.  Given the topic of that 
wiki, isn't that a bit odd?  I personally dislike the nofollow tag and 
think it should be used only in extreme circumstances (i.e. here's a 
link to a site you absolutely don't want to visit).  I believe in this 
case however it's simply being used so that sites that are listed 
don't get any pagerank/weight/whatever passed to them from an 
authority site.  A really bizarre policy for a search related site IMO.




I think it's a default setting for the Wiki, which nobody bothered to 
change...



Swinging back on topic, does nutch obey the nofollow tags?


Yes. Please see HtmlParser and HTMLMetaTags classes for details.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Add Wyona to the wiki support page?

2006-06-21 Thread Insurance Squared Inc.
Well so much for knee-jerk suspicions as to intent.  No need to look for 
conspiracy theories when default settings are more likely to be the 
cause.  That should probably a corollary to occam's razor or something :).



Andrzej Bialecki wrote:


Insurance Squared Inc. wrote:

The funny thing about that wiki page (and some others in that area) 
is that they apparently use the nofollow tags.  Given the topic of 
that wiki, isn't that a bit odd?  I personally dislike the nofollow 
tag and think it should be used only in extreme circumstances (i.e. 
here's a link to a site you absolutely don't want to visit).  I 
believe in this case however it's simply being used so that sites 
that are listed don't get any pagerank/weight/whatever passed to them 
from an authority site.  A really bizarre policy for a search related 
site IMO.




I think it's a default setting for the Wiki, which nobody bothered to 
change...



Swinging back on topic, does nutch obey the nofollow tags?



Yes. Please see HtmlParser and HTMLMetaTags classes for details.



Re: Full fledged Lucene Query Syntax support in Nutch

2006-05-04 Thread Ravi Chintakunta

Performance might be a reason, but only the queries that include
wildcards or fuzzy characters would be slowed down but not all the
queries right? The regular plain text searches performance shouldn't
be affected.

Any thoughts?

Thanks,
Ravi Chintakunta

On 5/3/06, Ravish Bhagdev [EMAIL PROTECTED] wrote:

reason is performance.  Allowing above means more complex query which causes
more dealy in getting results.  If you need these features, you know how to
get them, but its tradeoff with performance.  May be not if number of pages
are less, it will on large scale.

-- Ravish.


On 5/2/06, Ravi Chintakunta [EMAIL PROTECTED] wrote:

 Lucene supports fuzzy, wildcard, range, proximity searches as listed
 here: http://lucene.apache.org/java/docs/queryparsersyntax.html

 But Nutch does not use all these capabilities. It is limited by query
 parsing in org.apache.nutch.analysis.NutchAnalysis and the query
 filters hosted in plugins.

 We have to modify the analyzer and add more plugins to Nutch to use
 the Lucene's query syntax. Or we have to directly use Lucene's Query
 Parser. I tried the second approach by modifying
 org.apache.nutch.searcher.IndexSearcher and that seems to work.

 Is there a reason that Nutch does not support the entire Lucene query
 syntax by default?

 Thanks in advance,
 Ravi Chintakunta





Full fledged Lucene Query Syntax support in Nutch

2006-05-02 Thread Ravi Chintakunta

Lucene supports fuzzy, wildcard, range, proximity searches as listed
here: http://lucene.apache.org/java/docs/queryparsersyntax.html

But Nutch does not use all these capabilities. It is limited by query
parsing in org.apache.nutch.analysis.NutchAnalysis and the query
filters hosted in plugins.

We have to modify the analyzer and add more plugins to Nutch to use
the Lucene's query syntax. Or we have to directly use Lucene's Query
Parser. I tried the second approach by modifying
org.apache.nutch.searcher.IndexSearcher and that seems to work.

Is there a reason that Nutch does not support the entire Lucene query
syntax by default?

Thanks in advance,
Ravi Chintakunta


Re: Full fledged Lucene Query Syntax support in Nutch

2006-05-02 Thread Herman Hardenbol
Sorry, I am on holiday until the 8th of May.

Please contact the [EMAIL PROTECTED] for urgent matters.

Kind regards, Herman.



HTTPS support?

2006-03-06 Thread David Odmark

Hi,

Does Nutch 0.8 support https fetches? If not, are there any active 
efforts to support it?


TIA,

David Odmark



Re: HTTPS support?

2006-03-06 Thread Andrzej Bialecki

David Odmark wrote:

Hi,

Does Nutch 0.8 support https fetches? If not, are there any active 
efforts to support it?


It does, using protocol-httpclient plugin.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Nutch doesn't support Korean?

2006-03-03 Thread Teruhiko Kurosaka
I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+ means
a Unicode character of the hex value ) are not
part of LETTER or CJK class.  This seems to me that
Nutch cannot handle Korean documents at all.

Is anybody successfully using Nutch for Korean?

-kuro


Re: Nutch doesn't support Korean?

2006-03-03 Thread Cheolgoo Kang
Hello,

There was similar issue with Lucene's StandardTokenizer.jj.

http://issues.apache.org/jira/browse/LUCENE-444

and

http://issues.apache.org/jira/browse/LUCENE-461

I'm have almost no experience with Nutch, but you can handle it like
those issues above.


On 3/4/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.

 Is anybody successfully using Nutch for Korean?

 -kuro



--
Cheolgoo


xquery support for nutch

2006-02-20 Thread Raghavendra Prabhu
Hi

It would be great if we provide xquery support to nutch

where expressions like 3 + 4=7 would be evaluated.

http://www.xml.com/pub/a/2002/10/16/xquery.html

It is just an idea and probably would make it a universal tool


Rgds
Prabhu


Single NutchBean and multiple indices support

2006-02-15 Thread Jack Tang
Hi there.

I am facing the same the question and looking for same solution.
Your solution seems easy:) My question is what file system the
application runs on?
LocalFileSystem or DistributedFileSystem?

Thanks
/Jack

On 2/9/06, Ravi Chintakunta [EMAIL PROTECTED] wrote:
 Hi David,

 Thanks for your reply.

 After posting the question, I have done this in a more optimum way.

 - I used only a single NutchBean and modified it so that the search
 method takes the indices being searched as an argument. This single
 NutchBean creates separate IndexReaders on the merged indices in the
 directories and keeps them in a map.

 - Based on the indexes that are searched, NutchBean creates an
 IndexSearcher using the appropriate IndexReaders. I have added a
 constructor to IndexSearcher that takes an array of IndexReaders and
 uses a MultiReader to initialize itself.

 - The NutchBean creates a single FetchedSegments with the combination
 of the segments directories in all the directories.

 The advantages with this are:

 - A single IndexReader for an index - so no additional filehandles are 
 created.
 - No opening / closing of readers or segments - this improves performance.


 - Ravi Chintakunta


  This is almost exactly what I've done.  I create a new NutchBean for
  each search, and point it at whichever of 9 subdirectories the user has
  selected; because I really don't want 511 (2^9-1) beans hanging around.
 
  The reason for the too many open files is that the NutchBean doesn't
  clean up after itself - I guess because for most people, the NutchBean
  is going to be reused.
 
  I added a close() method to FetchSegments.Segment in my installation,
  to close all the readers.  I added a closeSegments() method to
  NutchBean, to call close() on each segment that's been opened.  Then I
  call closeSegments() after each search.
 
  I realise that NutchBean really wasn't designed to support being
  instantiated once per search, but I don't care.  It works well, and
  performance is not an issue.
 
  Regards,
  David.
 
 
  Date: Mon, 6 Feb 2006 20:59:34 -0500
  From: Ravi Chintakunta [EMAIL PROTECTED]
  To: nutch-user@lucene.apache.org
  Subject: [Nutch-general] Dynamic merging of indices
  Reply-To: [EMAIL PROTECTED]
 
  I have multiple indices for the crawls across various intranet sites
  stored in separate folders. My search application should support
  searching across one or more of these indices dynamically - by way of
  checkboxes on the web page.  For this, I have modified NutchBean to
  create the IndexSearcher and FetchedSegments from the segments
  directory (not the merged index directory) in these folders.  Based on
  the selected intranet sites, a NutchBean is instantiated for the
  indices  of the selected sites and the results are displayed.
 
  With this I had the Too many open files error and have increased the
  number of files limit.
 
  This seems to work well now. But if I have 5 such sites, then I am
  opening 2^5 =3D 32 times more files than I would have opened.
 
  My question is: Is there a better way of doing this? Like:
 
  - Can I open an IndexReader on each of the merged index directory and
  dynamically create an IndexSearcher by merging these readers using
  MultiReader?
 
  - Is an IndexReader thread safe and can it be used simultaneously in
  different IndexSearchers?
 
  - Can I create the IndexReader on the merged index directory and
  create the corresponding FetchedSegments on the corresponding
  non-merged segments directory?
 
  Thanks
  Ravi Chintakunta
 
 
 
 
  
  This email may contain legally privileged information and is intended only 
  for the addressee. It is not necessarily the official view or
  communication of the New Zealand Qualifications Authority. If you are not 
  the intended recipient you must not use, disclose, copy or distribute this 
  email or
  information in it. If you have received this email in error, please contact 
  the sender immediately. NZQA does not accept any liability for changes made 
  to this email or attachments after sending by NZQA.
 
  All emails have been scanned for viruses and content by MailMarshal.
  NZQA reserves the right to monitor all email communications through its 
  network.
 
  
 
 



--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: Single NutchBean and multiple indices support

2006-02-15 Thread Ravi Chintakunta
Hi Jack,

It runs on a local file system.

- Ravi Chintakunta

On 2/15/06, Jack Tang [EMAIL PROTECTED] wrote:
 Hi there.

 I am facing the same the question and looking for same solution.
 Your solution seems easy:) My question is what file system the
 application runs on?
 LocalFileSystem or DistributedFileSystem?

 Thanks
 /Jack

 On 2/9/06, Ravi Chintakunta [EMAIL PROTECTED] wrote:
  Hi David,
 
  Thanks for your reply.
 
  After posting the question, I have done this in a more optimum way.
 
  - I used only a single NutchBean and modified it so that the search
  method takes the indices being searched as an argument. This single
  NutchBean creates separate IndexReaders on the merged indices in the
  directories and keeps them in a map.
 
  - Based on the indexes that are searched, NutchBean creates an
  IndexSearcher using the appropriate IndexReaders. I have added a
  constructor to IndexSearcher that takes an array of IndexReaders and
  uses a MultiReader to initialize itself.
 
  - The NutchBean creates a single FetchedSegments with the combination
  of the segments directories in all the directories.
 
  The advantages with this are:
 
  - A single IndexReader for an index - so no additional filehandles are 
  created.
  - No opening / closing of readers or segments - this improves performance.
 
 
  - Ravi Chintakunta
 
 
   This is almost exactly what I've done.  I create a new NutchBean for
   each search, and point it at whichever of 9 subdirectories the user has
   selected; because I really don't want 511 (2^9-1) beans hanging around.
  
   The reason for the too many open files is that the NutchBean doesn't
   clean up after itself - I guess because for most people, the NutchBean
   is going to be reused.
  
   I added a close() method to FetchSegments.Segment in my installation,
   to close all the readers.  I added a closeSegments() method to
   NutchBean, to call close() on each segment that's been opened.  Then I
   call closeSegments() after each search.
  
   I realise that NutchBean really wasn't designed to support being
   instantiated once per search, but I don't care.  It works well, and
   performance is not an issue.
  
   Regards,
   David.
  
  
   Date: Mon, 6 Feb 2006 20:59:34 -0500
   From: Ravi Chintakunta [EMAIL PROTECTED]
   To: nutch-user@lucene.apache.org
   Subject: [Nutch-general] Dynamic merging of indices
   Reply-To: [EMAIL PROTECTED]
  
   I have multiple indices for the crawls across various intranet sites
   stored in separate folders. My search application should support
   searching across one or more of these indices dynamically - by way of
   checkboxes on the web page.  For this, I have modified NutchBean to
   create the IndexSearcher and FetchedSegments from the segments
   directory (not the merged index directory) in these folders.  Based on
   the selected intranet sites, a NutchBean is instantiated for the
   indices  of the selected sites and the results are displayed.
  
   With this I had the Too many open files error and have increased the
   number of files limit.
  
   This seems to work well now. But if I have 5 such sites, then I am
   opening 2^5 =3D 32 times more files than I would have opened.
  
   My question is: Is there a better way of doing this? Like:
  
   - Can I open an IndexReader on each of the merged index directory and
   dynamically create an IndexSearcher by merging these readers using
   MultiReader?
  
   - Is an IndexReader thread safe and can it be used simultaneously in
   different IndexSearchers?
  
   - Can I create the IndexReader on the merged index directory and
   create the corresponding FetchedSegments on the corresponding
   non-merged segments directory?
  
   Thanks
   Ravi Chintakunta
  
  
  
  
   
   This email may contain legally privileged information and is intended 
   only for the addressee. It is not necessarily the official view or
   communication of the New Zealand Qualifications Authority. If you are not 
   the intended recipient you must not use, disclose, copy or distribute 
   this email or
   information in it. If you have received this email in error, please 
   contact the sender immediately. NZQA does not accept any liability for 
   changes made to this email or attachments after sending by NZQA.
  
   All emails have been scanned for viruses and content by MailMarshal.
   NZQA reserves the right to monitor all email communications through its 
   network.
  
   
  
  
 


 --
 Keep Discovering ... ...
 http://www.jroller.com/page/jmars



Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Chris Mattmann
Hi,


   the contentTitle will be a concatenation of the titles of the RSS Channels
 that we've parsed.
   So the titles of the RSS Channels are what delivered for indexing, right?

They're certainly part of it, but not the only part. The concatenation of
the titles of the RSS Channels are what is delivered for the title portion
of indexing.

   If I want the indexer to include more information about a rss file (such
 as item descriptions), can I just concatenate them to the contentTitle?

They're already there. There is a variable called index text: ultimately
that variable includes the item descriptions, along with the channel
descriptions. That, along with the title portion of indexing is the full
set of textual data delivered by the parser for indexing. So, it already
includes that information. Check out lines 137, and 161 in the parser to see
what I mean. Also, check out lines 204-207, which are:

ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,

contentTitle.toString(), outlinks, content.getMetadata());

parseData.setConf(this.conf);

return new ParseImpl(indexText.toString(), parseData);

You can see that the return from the Parser, i.e., the ParseImpl, includes
both the indexText, along with the parse data (that contains the title
text).

Now, if you wanted to add any other metadata gleaned from the RSS to the
title text, or the content text, you can always modify the code to do that
in your own environment. The RSS Parser plugin returns a full channel model
and item model that can be extended and used for those purposes.

Hope that helps!

Cheers,
  Chris


 
 
 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 
   That should work: however, the biggest problem will be making sure that
 text/xml is actually the content type of the RSS that you are parsing,
 which you'll have little or no control over.
 
 Check out this previous post of mine on the list to get a better idea of
 what the real issue is:
 
 http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
 
 G'luck!
 
 Cheers,
 Chris
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 Phone:  818-354-8810
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 -Original Message-
 From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
 Sent: Saturday, February 04, 2006 11:40 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Which version of rss does parse-rss plugin support?
 
 Hi Chris
 
 
 How do I change the plugin.xml? For example, if I want to crawl rss
 files
 end with xml, just add a new element?
 
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=xml/
 
 Am I right?
 
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 Sure it will, you just have to configure it to do that. Pop over to
 $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
 there
 is
 an attribute called pathSuffix. Change that to handle whatever type
 of
 rss
 file you want to crawl. That will work locally. For web-based crawls,
 you
 need to make sure that the content type being returned for your RSS
 content
 matches the content type specified in the plugin.xml file that
 parse-rss
 claims to support.
 
 Note that you might not have * a lot * of success with being able to
 control the content type for rss files returned by web servers. I've
 seen
 a
 LOT of inconsistency among the way that they're configured by the
 administrators, etc. However, just to let you know, there are some
 people
 in
 the group that are working on a solution to addressing this.
 
 Hope that helps.
 
 Cheers,
 Chris
 
 
 
 On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
 Hi *Chris,*
 
 The files of RSS 1.0 have a postfix of rdf. So willthe parser
 recognize
 it
 automatically as a rss file?
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 
 parse-rss is based on commons-feedparser
 (http://jakarta.apache.org/commons/sandbox/feedparser). From the
 feedparser
 website:
 
 ...commons-feedparser supports all versions of RSS (0.9, 0.91,
 0.92,
 1.0,
 and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
 extension
 and RSS 1.0 modules capability...
 
 Hope that helps

Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Elwin
According to the code:
theOutlinks.add(new Outlink(r.getLink(), r
.getDescription()));
I can see that item description is also included.

However, when I tried with this feed:
http://kgrimm.bravejournal.com/feed.rss
I can only get the title and description for channel and failed to search
the words in item description.

From the above code, the item description is combined with outlink url, is
it used as contentTitle for that url? When the outlink is fetched and
parsed, I think new data about that url will be generated.


在06-2-11,Chris Mattmann [EMAIL PROTECTED] 写道:

 Hi,


the contentTitle will be a concatenation of the titles of the RSS
 Channels
  that we've parsed.
So the titles of the RSS Channels are what delivered for indexing,
 right?

 They're certainly part of it, but not the only part. The concatenation of
 the titles of the RSS Channels are what is delivered for the title
 portion
 of indexing.

If I want the indexer to include more information about a rss file
 (such
  as item descriptions), can I just concatenate them to the contentTitle?

 They're already there. There is a variable called index text: ultimately
 that variable includes the item descriptions, along with the channel
 descriptions. That, along with the title portion of indexing is the full
 set of textual data delivered by the parser for indexing. So, it already
 includes that information. Check out lines 137, and 161 in the parser to
 see
 what I mean. Also, check out lines 204-207, which are:

ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,

 contentTitle.toString(), outlinks, content.getMetadata());

 parseData.setConf(this.conf);

 return new ParseImpl(indexText.toString(), parseData);

 You can see that the return from the Parser, i.e., the ParseImpl, includes
 both the indexText, along with the parse data (that contains the title
 text).

 Now, if you wanted to add any other metadata gleaned from the RSS to the
 title text, or the content text, you can always modify the code to do that
 in your own environment. The RSS Parser plugin returns a full channel
 model
 and item model that can be extended and used for those purposes.

 Hope that helps!

 Cheers,
 Chris


 
 
  在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
 
That should work: however, the biggest problem will be making sure
 that
  text/xml is actually the content type of the RSS that you are
 parsing,
  which you'll have little or no control over.
 
  Check out this previous post of mine on the list to get a better idea
 of
  what the real issue is:
 
  http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
 
  G'luck!
 
  Cheers,
  Chris
 
 
  __
  Chris A. Mattmann
  [EMAIL PROTECTED]
  Staff Member
  Modeling and Data Management Systems Section (387)
  Data Management Systems and Technologies Group
 
  _
  Jet Propulsion LaboratoryPasadena, CA
  Office: 171-266BMailstop:  171-246
  Phone:  818-354-8810
  ___
 
  Disclaimer:  The opinions presented within are my own and do not
 reflect
  those of either NASA, JPL, or the California Institute of Technology.
 
  -Original Message-
  From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
  Sent: Saturday, February 04, 2006 11:40 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Which version of rss does parse-rss plugin support?
 
  Hi Chris
 
 
  How do I change the plugin.xml? For example, if I want to crawl rss
  files
  end with xml, just add a new element?
 
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=rss/
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=xml/
 
  Am I right?
 
 
 
  在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
  Sure it will, you just have to configure it to do that. Pop over to
  $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
  there
  is
  an attribute called pathSuffix. Change that to handle whatever type
  of
  rss
  file you want to crawl. That will work locally. For web-based crawls,
  you
  need to make sure that the content type being returned for your RSS
  content
  matches the content type specified in the plugin.xml file that
  parse-rss
  claims to support.
 
  Note that you might not have * a lot * of success with being able to
  control the content type for rss files returned by web servers. I've
  seen
  a
  LOT of inconsistency among the way that they're configured by the
  administrators, etc. However, just

opensearch support

2006-02-07 Thread Geraint Williams
Is OpenSearch being developed?

I am using nutch 0.7 and it seems to have some opensearch support.

However, I failed to get either a python or perl opensearch client
library (admittedly these are also in early development).  The perl
library seemed to choke at not finding the OpenSearchDescription, I
didn't have enough time to investigate.

I can of course, just post and parse the xml search results manually.

Thanks,
Geraint


RE: Which version of rss does parse-rss plugin support?

2006-02-05 Thread Chris Mattmann
Hi there,

   That should work: however, the biggest problem will be making sure that
text/xml is actually the content type of the RSS that you are parsing,
which you'll have little or no control over. 

Check out this previous post of mine on the list to get a better idea of
what the real issue is:

http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html

G'luck!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
 Sent: Saturday, February 04, 2006 11:40 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Which version of rss does parse-rss plugin support?
 
 Hi Chris
 
 
 How do I change the plugin.xml? For example, if I want to crawl rss files
 end with xml, just add a new element?
 
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=xml/
 
 Am I right?
 
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
  Sure it will, you just have to configure it to do that. Pop over to
  $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there
  is
  an attribute called pathSuffix. Change that to handle whatever type of
  rss
  file you want to crawl. That will work locally. For web-based crawls,
 you
  need to make sure that the content type being returned for your RSS
  content
  matches the content type specified in the plugin.xml file that parse-rss
  claims to support.
 
  Note that you might not have * a lot * of success with being able to
  control the content type for rss files returned by web servers. I've
 seen
  a
  LOT of inconsistency among the way that they're configured by the
  administrators, etc. However, just to let you know, there are some
 people
  in
  the group that are working on a solution to addressing this.
 
  Hope that helps.
 
  Cheers,
  Chris
 
 
 
  On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
   Hi *Chris,*
  
   The files of RSS 1.0 have a postfix of rdf. So willthe parser
 recognize
  it
   automatically as a rss file?
  
  
   在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
  
   Hi there,
  
   parse-rss is based on commons-feedparser
   (http://jakarta.apache.org/commons/sandbox/feedparser). From the
   feedparser
   website:
  
   ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92,
  1.0,
   and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
  extension
   and RSS 1.0 modules capability...
  
   Hope that helps.
  
   Thanks,
   Chris
  
  
   On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
  
   I see the test file is of version 0.91.
   Does the plugin support higher versions like 1.0 or 2.0?
  
   --
   《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周
星驰岂是池中物,喜剧天
  分 既
   然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既
得千里马,又失千里马,
  当 然
   后悔莫及。
  
  
  
  
  
   --
   《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星
驰岂是池中物,喜剧天分既
   然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然
   后悔莫及。
 
 
 
 
 
 --
 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂
是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一
 展风采。无线既得千里马,又失千里马,当然后悔莫及。



Re: Which version of rss does parse-rss plugin support?

2006-02-05 Thread 盖世豪侠
Hi Chris,

 Thank you for your post and I've read it through.
  So, you mean I should also add these lines to the plugin.xml in most
cases:

   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
 ...
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=text/xml
   pathSuffix=xml/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=text/xml
   pathSuffix=rss/

在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:

 Hi there,

   That should work: however, the biggest problem will be making sure that
 text/xml is actually the content type of the RSS that you are parsing,
 which you'll have little or no control over.

 Check out this previous post of mine on the list to get a better idea of
 what the real issue is:

 http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html

 G'luck!

 Cheers,
 Chris


 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group

 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 Phone:  818-354-8810
 ___

 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.

  -Original Message-
  From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
  Sent: Saturday, February 04, 2006 11:40 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Which version of rss does parse-rss plugin support?
 
  Hi Chris
 
 
  How do I change the plugin.xml? For example, if I want to crawl rss
 files
  end with xml, just add a new element?
 
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=rss/
implementation id=org.apache.nutch.parse.rss.RSSParser
class=org.apache.nutch.parse.rss.RSSParser
contentType=application/rss+xml
pathSuffix=xml/
 
  Am I right?
 
 
 
  在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
  
   Hi there,
   Sure it will, you just have to configure it to do that. Pop over to
   $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
 there
   is
   an attribute called pathSuffix. Change that to handle whatever type
 of
   rss
   file you want to crawl. That will work locally. For web-based crawls,
  you
   need to make sure that the content type being returned for your RSS
   content
   matches the content type specified in the plugin.xml file that
 parse-rss
   claims to support.
  
   Note that you might not have * a lot * of success with being able to
   control the content type for rss files returned by web servers. I've
  seen
   a
   LOT of inconsistency among the way that they're configured by the
   administrators, etc. However, just to let you know, there are some
  people
   in
   the group that are working on a solution to addressing this.
  
   Hope that helps.
  
   Cheers,
   Chris
  
  
  
   On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
  
Hi *Chris,*
   
The files of RSS 1.0 have a postfix of rdf. So willthe parser
  recognize
   it
automatically as a rss file?
   
   
在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
   
Hi there,
   
parse-rss is based on commons-feedparser
(http://jakarta.apache.org/commons/sandbox/feedparser). From the
feedparser
website:
   
...commons-feedparser supports all versions of RSS (0.9, 0.91,
 0.92,
   1.0,
and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
   extension
and RSS 1.0 modules capability...
   
Hope that helps.
   
Thanks,
Chris
   
   
On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
   
I see the test file is of version 0.91.
Does the plugin support higher versions like 1.0 or 2.0?
   
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周
 星驰岂是池中物,喜剧天
   分 既
然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既
 得千里马,又失千里马,
   当 然
后悔莫及。
   
   
   
   
   
--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星
 驰岂是池中物,喜剧天分既
然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得
 千里马,又失千里马,当然
后悔莫及。
  
  
  
 
 
  --
  《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂
 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一
  展风采。无线既得千里马,又失千里马,当然后悔莫及。




--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然

Re: Which version of rss does parse-rss plugin support?

2006-02-04 Thread 盖世豪侠
Hi Chris


How do I change the plugin.xml? For example, if I want to crawl rss files
end with xml, just add a new element?

  implementation id=org.apache.nutch.parse.rss.RSSParser
  class=org.apache.nutch.parse.rss.RSSParser
  contentType=application/rss+xml
  pathSuffix=rss/
  implementation id=org.apache.nutch.parse.rss.RSSParser
  class=org.apache.nutch.parse.rss.RSSParser
  contentType=application/rss+xml
  pathSuffix=xml/

Am I right?



在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:

 Hi there,
 Sure it will, you just have to configure it to do that. Pop over to
 $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there
 is
 an attribute called pathSuffix. Change that to handle whatever type of
 rss
 file you want to crawl. That will work locally. For web-based crawls, you
 need to make sure that the content type being returned for your RSS
 content
 matches the content type specified in the plugin.xml file that parse-rss
 claims to support.

 Note that you might not have * a lot * of success with being able to
 control the content type for rss files returned by web servers. I've seen
 a
 LOT of inconsistency among the way that they're configured by the
 administrators, etc. However, just to let you know, there are some people
 in
 the group that are working on a solution to addressing this.

 Hope that helps.

 Cheers,
 Chris



 On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:

  Hi *Chris,*
 
  The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize
 it
  automatically as a rss file?
 
 
  在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
 
  parse-rss is based on commons-feedparser
  (http://jakarta.apache.org/commons/sandbox/feedparser). From the
  feedparser
  website:
 
  ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92,
 1.0,
  and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
 extension
  and RSS 1.0 modules capability...
 
  Hope that helps.
 
  Thanks,
  Chris
 
 
  On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
  I see the test file is of version 0.91.
  Does the plugin support higher versions like 1.0 or 2.0?
 
  --
  《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天
 分 既
  然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,
 当 然
  后悔莫及。
 
 
 
 
 
  --
  《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既
  然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然
  后悔莫及。





--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。


Which version of rss does parse-rss plugin support?

2006-02-03 Thread 盖世豪侠
I see the test file is of version 0.91.
Does the plugin support higher versions like 1.0 or 2.0?

--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。


Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread Chris Mattmann
Hi there,

  parse-rss is based on commons-feedparser
(http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser
website:

...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0,
and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension
and RSS 1.0 modules capability...

Hope that helps.

Thanks,
  Chris


On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:

 I see the test file is of version 0.91.
 Does the plugin support higher versions like 1.0 or 2.0?
 
 --
 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既
 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然
 后悔莫及。




Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread 盖世豪侠
Hi *Chris,*

The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it
automatically as a rss file?


在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:

 Hi there,

 parse-rss is based on commons-feedparser
 (http://jakarta.apache.org/commons/sandbox/feedparser). From the
 feedparser
 website:

 ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0,
 and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension
 and RSS 1.0 modules capability...

 Hope that helps.

 Thanks,
 Chris


 On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:

  I see the test file is of version 0.91.
  Does the plugin support higher versions like 1.0 or 2.0?
 
  --
  《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既
  然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然
  后悔莫及。





--
《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。


Multi CPU support

2006-01-09 Thread Teruhiko Kurosaka
Can I use MapReduce to run Nutch on a multi CPU system?

I want to run the index job on two (or four) CPUs
on a single system.  I'm not trying to distribute the job
over multiple systems.

If the MapReduce is the way to go,
do I just specify config parameters like these:
mapred.tasktracker.tasks.maxiumum=2
mapred.job.tracker=localhost:9001
mapred.reduce.tasks=2 (or 1?)

and
bin/start-all.sh

?

Must I use NDFS for MapReduce?

Do I need to do anything else to make sure that
the two processes run on different CPUs?

Is this the only way to take advantage of a multi CPU system?

-kuro


Re: Multi CPU support

2006-01-09 Thread Doug Cutting

Teruhiko Kurosaka wrote:

Can I use MapReduce to run Nutch on a multi CPU system?


Yes.


I want to run the index job on two (or four) CPUs
on a single system.  I'm not trying to distribute the job
over multiple systems.

If the MapReduce is the way to go,
do I just specify config parameters like these:
mapred.tasktracker.tasks.maxiumum=2
mapred.job.tracker=localhost:9001
mapred.reduce.tasks=2 (or 1?)

and
bin/start-all.sh

?


That should work.  You'd probably want to set the default number of map 
tasks to be a multiple of the number of CPUs, and the number of reduce 
tasks to be exactly the number of cpus.


Don't use start-all.sh, but rather just:

bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker


Must I use NDFS for MapReduce?


No.

Doug


multibyte character support status

2005-12-27 Thread Teruhiko Kurosaka
What is the current state and plan for multibyte
character support by Nutch?

As far as I can tell...

The PDF plugin uses PDFBox (www.pdfbox.org) which does not
work with Japanese and probably other multibyte characters
and code sets.

The Word plugin uses POI (http://jakarta.apache.org/poi/),
which doesn't seem to support Japanese. Some patches to
make it possible to support Japanese (and hopefully other
code sets) have been submitted to the POI project but
they have not been integrated because the project currently
has no committer.

RTF document plugin and PowerPoint plugin use home-grown
parsers.  What is the status of multibyte code set
(and single byte code set other than ISO-8859-1) support by
these plugins?

-Kuro


Re: PDF indexing support?

2005-11-16 Thread Håvard W. Kongsgård

Tanks it worked


Jérôme Charron wrote:


The value you specified is biggest than the maximal int value, so that it
return an exception, and then the default value is used.
As mentionned in the property's description, use a negative value (-1) for
no truncation at all (or a value lesser than java.lang.Interger.MAX_VALUE).

Regards

Jérôme

On 11/16/05, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:
 


Have now added conf/nutch-site.xml but still the same problem. | Related
to the problem? http://sourceforge.net/forum/message.php?msg_id=3391668
http://sourceforge.net/forum/message.php?msg_id=3398773

   


?xml version=1.0?
?xml-stylesheet type=text/xsl href=nutch-conf.xsl?
nutch-conf
property
namehttp.content.limit/name
value45451515565536/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
truncated;
otherwise, no truncation at all.
/description
/property
/nutch-conf

Håvard W. Kongsgård wrote:

 


HTTP


Sébastien LE CALLONNEC wrote:

   


Hej Håvard,

That's because you have to create one yourself. The values you will
set in there will override the default values.

Here are a few more questions to try to solve your problem: where is
your PDF located? What protocol is used to fetch it (HTTP, FTP, etc.)?


Regards,
/sebastien

--- Håvard W. Kongsgård [EMAIL PROTECTED] a écrit :



 


Don't have a conf/nutch-site.xml



Jérôme Charron wrote:



   


conf/nutch-default


   


Checks that they are not overrided in the conf/nutch-site
If no, sorry, no more idea for now :-(

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




 



   



   



   


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date:

 


15.11.2005


   



 



   








 


___
   


Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo!
Messenger Téléchargez cette version sur http://fr.messenger.yahoo.com




 



   





--
http://motrech.free.fr/
http://www.frutch.org/

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.3/173 - Release Date: 16.11.2005
 





Re: PDF indexing support?

2005-11-16 Thread Hasan Diwan


On Nov 15, 2005, at 2:46 PM, Håvard W. Kongsgård wrote:


Don't have a conf/nutch-site.xml


Create it and put the overrides in there, per the nutch tutorial.

Cheers,
Hasan Diwan [EMAIL PROTECTED]



PGP.sig
Description: This is a digitally signed message part


Re: PDF indexing support?

2005-11-15 Thread Stefan Groschupf

Verify that you have the very latest PDFBOX from there website.
A lot of people notice that pdf box is a  little bit buggy.

Stefan

Am 15.11.2005 um 22:27 schrieb Håvard W. Kongsgård:


Nutch won't index some of my PDF files I get this error:
reason: failed(2,202): Content truncated at 66608 bytes. Parser  
can't handle incomplete pdf file.


Is there a bug in the pdfplugin (PDFBOX) I am using nutch 0.7.1.
I know from experience that some pdf to text programs like xpdf  
have some problems with pdf v 1.6(adobe acrobat 7/CS).


Jérôme Charron wrote:


Hello I new with nutch how do I enable PDF indexing support?



Simply by activating the parse-pdf plugin in nutch-default.xml or
nutch-site.xml
(take a look at the plugin.includes property)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


- 
---


No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date:  
15.11.2005







---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: PDF indexing support?

2005-11-15 Thread Håvard W. Kongsgård

conf/nutch-default



Jérôme Charron wrote:


http.content.limit=542256565536 and file.content.limit=4541165536
still the same error:
   



where do you specify these values? in nutch-default or nutch-site?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
 





Re: PDF indexing support?

2005-11-15 Thread Jérôme Charron
 conf/nutch-default

Checks that they are not overrided in the conf/nutch-site
If no, sorry, no more idea for now :-(

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: PDF indexing support?

2005-11-15 Thread Håvard W. Kongsgård

Don't have a conf/nutch-site.xml



Jérôme Charron wrote:


conf/nutch-default
   



Checks that they are not overrided in the conf/nutch-site
If no, sorry, no more idea for now :-(

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

 




No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.1/169 - Release Date: 15.11.2005
 





PDF indexing support?

2005-11-14 Thread Håvard W. Kongsgård

Hello I new with nutch how do I enable PDF indexing support?


PDF support? Does crawl parse p

2005-08-31 Thread Diane Palla
Does Nutch have a way to parse pdf files, that is, application/pdf 
content type files?

I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf


If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?


Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]




Bryan Woliner [EMAIL PROTECTED] 
08/23/2005 05:22 PM
Please respond to
nutch-user@lucene.apache.org


To
nutch-user@lucene.apache.org
cc

Subject
Adding small batches of fetched URLs to a larger aggregate segment/index






Hi,

I have a number of sites that I want to crawl, then merge their segments 
and 
create a single index. One of the main reasons I want to do this is that I 

want some of the sites in my index to be crawls on a daily basis, others 
on 
a weekly basis, etc. Each time I re-crawl a site, I want to add the 
fetched 
URLs to a single aggregate segment/index. I have a couple questions about 
doing this:

1. Is it possible to use a different regex.urlfilter.txt file for each 
site 
that I am crawling? If so, how would I do this?

2. If I have a very large segment that is indexed (my aggregate index) and 
I 
want to add another (much smaller) set of fetched URLs to this index, what 

is the best way to do this. It seems like merging the small and large 
segments and then re-indexing the whole thing would be very time consuming 

-- especially if I wanted to add news small sets of fetched URLs 
frequently. 


Thanks for any suggestions you have to offer,
Bryan



Re: PDF support? Does crawl parse p

2005-08-31 Thread Piotr Kosiorowski

Hello Diane,
There is a plugin to parse pdf files. You have to enable it in 
nutch-site.xml (just copy entry from nutch-default.xml).

You have to change plugin.includes property to include parse-pdf plugin:
[...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...]
Regards
Piotr

Diane Palla wrote:
Does Nutch have a way to parse pdf files, that is, application/pdf 
content type files?


I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf



If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?



Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]




Bryan Woliner [EMAIL PROTECTED] 
08/23/2005 05:22 PM

Please respond to
nutch-user@lucene.apache.org


To
nutch-user@lucene.apache.org
cc

Subject
Adding small batches of fetched URLs to a larger aggregate segment/index






Hi,

I have a number of sites that I want to crawl, then merge their segments 
and 
create a single index. One of the main reasons I want to do this is that I 

want some of the sites in my index to be crawls on a daily basis, others 
on 
a weekly basis, etc. Each time I re-crawl a site, I want to add the 
fetched 
URLs to a single aggregate segment/index. I have a couple questions about 
doing this:


1. Is it possible to use a different regex.urlfilter.txt file for each 
site 
that I am crawling? If so, how would I do this?


2. If I have a very large segment that is indexed (my aggregate index) and 
I 
want to add another (much smaller) set of fetched URLs to this index, what 

is the best way to do this. It seems like merging the small and large 
segments and then re-indexing the whole thing would be very time consuming 

-- especially if I wanted to add news small sets of fetched URLs 
frequently. 



Thanks for any suggestions you have to offer,
Bryan






Re: metadata support in WebDB (Stefan's NUTCH-59 patch)

2005-07-19 Thread Stefan Groschupf

Hi Otis,


http://issues.apache.org/jira/browse/NUTCH-59

This patch looks interesting for my Nutch needs,

So please vote for the patch if you like it. :-)


I can't look at the code, but looking at your diff, it looks like this
metadata would be stored somewhere inside Nutch's WebDB, and that one
would have to provide this metadata to Nutch during URL injection
is this correct?

Yes, meta data are part of the page object and stored in the webdb.
You can add  metadata in any situation you maintain this page object.
So you can have a custom injector as you describe to set meta data,  
but more interesting you can  set them until fetch time as well or in  
any situation you have access to the page object. (e.g. segment  
generation, dbupdated etc.)




I currently have this little wrapper method around a few of Nutch's
tool classes (below).  I first generate a plain-text file with all  
URLs

I want to fetch, then I call the method below, and then I just call
Fetcher.main(...).  If I want to associate some metadata with each URL
to be fetched, where would I insert it into the system?  Would I need
my own injector class with my own addPage method that pulls  
metadata in

(from some external storage) for each URL it gets, and call
dbWriter.addPageIfNotPresent(page) like WebDBInjector does with DMOZ
data?


Yes.
I personal suggest create a extension point for the injector. May  
other people find that interesting as well and you can contribute  
this extension point.  :)
Write a small plugin that lookup the meta data you plan to add from a  
mysql-db or so and add them to the page object. That's it.
You can do very interesting things until the life-cycle of the page  
object.
For example generate metadata from html content, fetch time (more  
intelligent fetchnig), or hand over meta data from pages to links etc.


Keep in mind that you can have other storage types than the actually  
existing map for storing meta data. For example you can implement a  
StringArray or other datatypes.




In general I would love to see more interest in this patch and some  
votes since I think that such meta data can brings a lot of new  
possible features to nutch.
The very interesting part is that if you do not use meta data the web  
db is not blowed up and that this patch does not slow down the web db  
processing speed.


Greetings,
Stefan