Re: End-Of-Life status for 0.7.x?

2008-01-18 Thread Jérôme Charron
+1

On Jan 18, 2008 5:22 AM, Sami Siren [EMAIL PROTECTED] wrote:

 Andrzej Bialecki wrote:
  Hi all,
  My opinion is that we should mark it EOL, and close all JIRA issues that
  are relevant only to 0.7.x, with the status Won't Fix.
 

 +1

 --
  Sami Siren




-- 
Jérôme Charron
Directeur Technique @ WebPulse
Tel: +33673716743 - [EMAIL PROTECTED]
http://blog.shopreflex.com/
Tous les goûts sont dans la nature, les vôtres sont sur
http://www.shopreflex.com


Re: log guards

2007-02-13 Thread Jérôme Charron

These guards were all introduced by a patch some time ago.  I complained
at the time and it was promised that this would be repaired, but it has
not yet been.


Yes, Sorry Doug that's my own fault
I really don't have time to fix this   :-(

Best regards

Jérôme


Re: log guards

2007-02-13 Thread Jérôme Charron

Hi Chris,

The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
Thanks for your help.

Jérôme

On 2/13/07, Chris Mattmann [EMAIL PROTECTED] wrote:


Hi Doug, and Jerome,

  Ah, yes, the log guard conversation. I remember this from a while back.
Hmmm, do you guys know what issue that this recorded as in JIRA? I have
some
free time recently, so I will be able to add this to my list of Nutch
stuff
to work on, and would be happy to take the lead on removing the guards
where
needed, and reviewing whether or not the debug ones make sense where they
are.

Cheers,
  Chris



On 2/13/07 11:17 AM, Jérôme Charron [EMAIL PROTECTED] wrote:

 These guards were all introduced by a patch some time ago.  I
complained
 at the time and it was promised that this would be repaired, but it has
 not yet been.

 Yes, Sorry Doug that's my own fault
 I really don't have time to fix this   :-(

 Best regards

 Jérôme

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.





Re: implement thai language indexing and search

2006-11-28 Thread Jérôme Charron

i used an existing ThaiAnalyzer which was in lucene package.
ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
and
placed all class files in a jar - analysis-th.jar (do i need to bundle the
ngp file in the jar as well ?)


1. You don't have to refactor the lucene analyzer. Just to wrap it like I do
with french and german analyzers (they both use some analyzers from lucene).
2. Analyzer doesn't need ngp files... I think you misunderstood something:
2.1 In one side there is the language identifier that use NGP files to
identify language of a document
2.2 In the other sided if a suitable analyzer is found for the identified
language, it is used to analyze the document.

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


Re: implement thai language indexing and search

2006-11-16 Thread Jérôme Charron

ok. I was able to enable the language identifier plugin by adding the
value
in plugin.includes attribute
in nutch-site.xml - but i'm not sure just by doing that I can have thai
text
recognized and tokenized
properly.
What else do I have to do ? Please help me.


1. You must create a thai NGP (Ngram Profile file) so that the language
identifier can identify thai !
2. You must create a thai analyzer (see for instance analysis-fr and
analysis-de sample analyzers).

Best Regards

Jérôme


Re: Content-type detection for Tika

2006-09-06 Thread Jérôme Charron



I'm thinking about implementing the (draft) shared MIME database spec
[1] from freedesktop.org in Tika as a modern MIME magic implementation
to help automatically detect and handle the types of resources where
insufficient typing metadata is available. The specified typing
information also includes an inheritance model which allows for
automatic failover to more generic parsers (e.g. from image/svg to
text/xml) when specific parser plugins are not available.


I already have such code for Nutch (freedesktop based content-type
detection).
These days, I have no more time to spend on Nutch, but I can send you the
code.
Please contact me on my private mail.

Regards

Jérôme


Re: Antwort: Re: parse-plugins.xml

2006-08-04 Thread Jérôme Charron

What you probably mean is something equivalent to Unix strings(1). I
have a plugin that implements this, which I could contribute if there's
interest.


+1

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Error with Hadoop-0.4.0

2006-07-07 Thread Jérôme Charron

I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.


Thanks for this feedback Stefan.



We should fix that.


What I suggest, is simply to remove the line 75 in createJob method from
CrawlDb :
setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
In fact, this method is only used by Injector.inject() and CrawlDb.update()
and
the inputPath setted in createJob is not needed neither by Injector.inject()
nor
CrawlDb.update() methods.

If no objection, I will commit this change tomorrow.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Error with Hadoop-0.4.0

2006-07-06 Thread Jérôme Charron

Hi,

I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).

In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

By looking at the Nutch code, and simply changing the line 145 of Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I finaly dont
understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));

For curiosity, if a hadoop guru can explain why there is such a
regression...

Does somebody have the same error?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Possible memory leak?

2006-06-28 Thread Jérôme Charron

It seems to be a side effect of NUTCH-169 (remove static NutchConf).
Prior to this, the language identifier was a singleton.
I think we should cache its instance in the conf as we do for many others
objects
in Nutch.
Enrico, could you please create a JIRA issue.

Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: noindedo not index/noindex

2006-06-22 Thread Jérôme Charron

as far I can see nutch's html parser does only support the meta tag
noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there
is an inoffiziel html noindex tag.
http://www.webmasterworld.com/forum10003/2703.htm


Hello Stefan,

Here is a previous discussion about this :
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg04576.html



May be this would be another thing to make nutch more polite.


+1

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ java/org/apach

2006-06-22 Thread Jérôme Charron

I don't think guards should be added everywhere.


That's right Doug.
It was a rude first pass on logging.
The next pass (finest) will be done with NUTCH-310.



Rather, guards should only be added
in performance critical code, and then only for Debug-level output.
Info and Warn levels are normally enabled, and developers should
thus not log messages at these levels so frequently that performance
will be compromised.


Yes, but that's actually not the case in Nutch : The major part of logging
statements are using Info level.



  And not all Debug-level log statements need
guards, only those that are in inner loops, where the construction of
the log message may significantly affect performance.


I plan to review all the logging statements while working on NUTCH-310,
and I will then follow your directions.

Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [Nutch-cvs] svn commit: r414681 - /lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ProtocolFactory.java

2006-06-16 Thread Jérôme Charron

I'm somewhat worried about the possible clash in the conf name-space -
usually, when we store Object's in Configuration instance, we use their
full class name, or at least a long and most probably unique string. In
this case, we use just http, https, ftp, file and so on ...
Would it make sense if in this special case we used the X_POINT +
protocolName as the unique string?


+1
(why not using directly the extension id ?)


--
http://motrech.free.fr/
http://www.frutch.org/


Re: [jira] Resolved: (NUTCH-303) logging improvements

2006-06-13 Thread Jérôme Charron

There seems to be two log4j.properties files in generated war, is this
intentional?


Not intentional. A side effect.
In fact, the first one is the one that comes from conf dir (I will exlude it
in war so that it will be clearier).
The second one (that override the first one) is the good one that comes from
web directory.

However

it works just fine.


That's a good news.
Sami, I have not made changes to web2. Do you want that I switch web2 to
Commons Logging?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Nutch logging questions

2006-06-09 Thread Jérôme Charron

Hi,

I'm currently working on NUTCH-303 so that nutch uses commons logging facade
API and log4j
as the default implementation. All the code is actually switched to and uses
Commons Logging API, and I have
replaced some System.out and printStackTrace to make use of Commons Logging.

To finalize this patch, my problem is on the configuration:

1. Does the back-end and front-end should have the same logging
configuration?
2. What kind of configuration do you think is the best one by default?
For now, I have used the same log4 properties than hadoop (see
http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254
) for the back-end, and
I was thinking to use the stdout for front-end.
What do you think about this?
3. When using the default DRFA appender (Daily Rolling File Appender) in
nutch, should I log in the the hadoop log file or in a nutch file?

Thanks for your feed-back.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Status of language plugin

2006-06-07 Thread Jérôme Charron

Is there an API doc or design doc that I can read to
understand where you are? Is the language plugin architecture
already in the main trunk?


The only available document is
http://wiki.apache.org/nutch/MultiLingualSupport
and sometimes I maintain this page
http://wiki.apache.org/nutch/JeromeCharron



Here are some issues that I've been worried about:
* Support of multilingual plugin?
** If one plugin can support more than one languages,
   the language needs to be passed at each analyzsis.


I don't understand your need.
But if you have an analysis plugin that can handle many languages, you
can simply define many implementations in your plugin xml, ie

extension id=org.apache.nutch.analysis.cjk
 name=CJKAnalyzer
 point=org.apache.nutch.analysis.NutchAnalyzer

 implementation id=org.apache.nutch.analysis.cn.ChineseAnalyzer
 class=org.apache.nutch.analysis.cjk.CJKAnalyzer 
   parameter name=lang value=cn/
 /implementation

 implementation id=org.apache.nutch.analysis.kr.KoreanAnalyzer
 class=org.apache.nutch.analysis.cjk.CJKAnalyzer
   parameter name=lang value=kr/
 /implementation

 implementation id=org.apache.nutch.analysis.jp.JapaneseAnalyzer
 class=org.apache.nutch.analysis.cjk.CJKAnalyzer
   parameter name=lang value=jp/
 /implementation

  /extension



** This assumes language identification is done before
   analysis.  Is it the case ?


Yes.



* Support of a different analyzer for query than index
** Analyzer for query may need to behave differently than
   analyzer for indexinging.  Can your architecture
   specify different analyzers for indexing and query?


In fact, to avoid adding a QueryAnalyser extension point,
the Query use the same Analyzer implementation that the one
for document analysis.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Jérôme Charron

URL: http://svn.apache.org/viewvc?rev=411943view=rev
Log:
Updating to Hadoop release 0.3.1.  Hadoop now uses Jakarta Commons
Logging, configured for log4j by default.


If log4j is now included in the core, we can remove the lib-log4j plugin.
If no objection, I will doing it.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Jérôme Charron

As far I understand hadoop use commons logging. Should we switch to
use commons logging as well?


Why not...
(but using commons logging doesn't exclude to have a default implementation,
such as log4j used by hadoop).


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Jérôme Charron

You're right -- changing anything with the input (snippets length,
number of documents etc) will alter the clusters. This is basically how
it works. If you want clustering in your search engine then, depending
on the type of data you serve, you'll have to experiment with the
settings a bit and see which give you satisfactory results. I don't
think there is any particular reason to provide different data to the
clusterer. Moreover, it'd complicate things quite badly.


Thanks Dawid for your response.
In fact, I don't really want to change this, but just to be sure that
everybody is aware about it and to have some opinions.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron

Add 3. Clustering would benefit from a plain text version.


Yes Dawid, but it is already committed = the clustering now uses the plain
text version returned by the toString() method.

Dawid, I have a question about clustering.
Actually, the clustering uses the summaries as input. I assumes it would
provides some better results if it takes the whole documents content. no?
I assumes that clustering uses the summaries instead of documents content
for some performances purpose.
But there is a (bad) side effect : since the size of the summaries is
configurable, the clustering quality will vary depending on the summaries
size configuration. I really found this very confusing : when folks adjust
this parameter it is only for front-end consideration (they want to display
a long or a short summary), but certainly not for clustering reasons.

What you and others thinks about this?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron

 (but if the nutch-site.xml overrides the plugin.include property and
 doen't
 include it it will not be activated, like any other plugin)
yes, that's what I ment, I quess that's the default case for people
hacking plugins.


Oh, yes Sami, I understand what you mean...
Sorry, I just forgot to mention this point on the list (so, plugins hackers,
you need to add one of the new summary plugin if you want to have some
summaries displayed).
Sorry, I forgot too to add summary plugins in the default webapp context
file (nutch.xml) ... I will add this once the svn write access will be
available.
And one more time sorry, because I forgot too to report summary APIs changes
to web2 module...

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron

Bob Carpenter of alias-i had this to say when I brought up this very
idea:
http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599


Thanks for you response Marvin.
But finally my question is : shouldn't the nutch clustering uses some
fixed size snippets instead of the configurable displayed size?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Jérôme Charron

This means there's no markup in the OpenSearch output?


Yes, no markup for now.



Shouldn't there be?


The restriction on description field is : Can contain simple escaped HTML
markup, such as b, i, a, and img elements.
So, ya, why not. We can add b around highlights.
What you and others thinks?



Perhaps this should be a method on Summary, to render it as html?


I had some hesitations about this while coding 
In fact, as suggested in the issue's comments, I would like to add a generic
method on Summary :
String toString(Encoder, Formatter) like in the Lucene's Highlighter and
provide some basic implementations of Encoder and Formatter.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Jérôme Charron

 String toString(Encoder, Formatter) like in the Lucene's Highlighter and
 provide some basic implementations of Encoder and Formatter.
That sounds fine, but in the meantime, let's not reproduce the
html-specific code in lots of places.  We need it in both search.jsp and
in OpenSearchServlet.java.  So we should have it in a common place.  A
method on Summary seems like a good place.  If we subsequently add a
more general API then we could re-implement the toHtml() method using
that API, but I think a generic toHtml() method will be useful for quite
a while yet.


Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
a common place (Summary).
And add one specific Formatter implementation for OpenSearch and another one
for search.jsp :
The reason is that they should not use the same HTML code :
1. OpenSearch should only use b around highlights
2. search.jsp should use some more complicated HTML code (span ... )

In fact, I don't know if the Formatter solution is the good one, but the
toString() or toHtml() must be parametrized
since the two pieces of code that use this method should have distinct
outputs.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: http chunked content

2006-05-08 Thread Jérôme Charron

As far I know a lot of http servers response with chunked content at
least all that return dynamically generated pages.
Should I file a bug?
Any thoughts?


In fact, the requests issued from http plugin are in HTTP 1.0, so the
servers should never return some chuncked content.
I think that the readChunkedContent was included in the code for a future
use.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Feature idea - Indexing Text Lengths

2006-05-07 Thread Jérôme Charron

Sorry i cant give more then an idea, I'm not a java developer, but I think
the idea could prove useful.
The idea is to limit the length of sentences that get entered into the
index. So, after parsing a page, and words that don't make what appears to
be a complete sentence get ignored.


Douglas,

Here is a previous discussion about this subject on the list:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg03070.html
Take a look at this thread.. this problem is not so easy.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-05-02 Thread Jérôme Charron

I'm not so sure.  When crawling Apache we had trouble with this feature.
  Some HTML files that had an XML header and the server identified as
text/html Nutch decided to treat as XML, not HTML.


Yes, the current version of the mime-type resolver is a crude one.
XML, HTML, RSS and all XML based files are not always correctly identified.
(this problem is well known, and cause troubles for instance with RSS feeds
that
return text/xml content-type).

 We had to turn off

the guessing of content types to index Apache correctly.


Instead of turning off the guessing of content types you should only to
remove
the magic for xml in mime-types.xml
In the new version (based on freedesktop) that is sleeping for a while on my
disk, I think
such problems are solved since it introduce many informations not included
in the current version:
hierarchy between content-types (text/html is a subclass of text/xml), some
way to express some complex magic clause, and so on.
For instance, it  can now correctly identify RSS documents : generally RSS
feeds are associated with a generic text/xml content-type, and
we cannot identify them = they fall back to the generic parse-text parser.



  I think we
shouldn't aim guess things any more than a browser does.  If browsers
require standards compliance, then our lives will be simpler.


Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because there
is a link rel=alternate type=.../
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).
Since all servers simply return the generic text/xml content-type, the only
way to know it is a rss related document is to use magic content-type
guessing (you can notice that many browsers doesnt identify it as a rss
document, but simply as a generic xml file).
One more thing is that actually, there is no officialy registered
content-type for rss. So, we can only use guessing from the document content
to know it is a rss document.


Jérôme


Re: [Nutch-cvs] svn commit: r397320 - /lucene/nutch/trunk/src/plugin/parse-oo/plugin.xml

2006-04-27 Thread Jérôme Charron
  parse-oo plugin manifest is valid with plugin.dtd
 Oops, I didn't catch that... Thanks!

No problem Andrzej.
It is just a cosmetic change since the plugin.xml are not validated at
runtime (it is in my todo list),
and the contentType and pathSuffix parameters are more or less deprecated.

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
 Are you mainly concerned with charset in Content-Type?

Not specifically.
But while looking at these content-type inconsistency, I noticed that there
is some prossible
troubles with charset in content-type.


 Currently, what happens when Content-Type exists in both HTTP layer and in
 META tag (if contents is HTML)?

We cannot use the one in Meta-tags : to extract it, we first need to know to
use the html parser.
Only the HTTP header is used.
It is then checked/guessed using the mime-type repository (it is a mime-type
database that contains mime-type and associated file extensions and
optionaly some magic-bytes).

How does Nutch guesses Content-Type, and when does it need to do that?

See my response above


 Is there a situation where the guessed content-type differs from the
 content-type in the metadata?

From the one in headers : yes (mainly when the server is badely configured)


Here is an easy way to reproduce what I mean by content-type inconsistency:
1. Perform a crawl of the following URL :
http://jerome.charron.free.fr/nutch/fake.zip
(fake.zip is a fake zip file, in fact it is a html one)
2. While crawling, you can see that the content-type returned by the server
is application/zip
3. But you can see that Nutch correctly guess the content-type to text/html
(it uses the HtmlParser)
4. At this step, all is ok.
5. Then start your tomcat and try the following search : zip
6. You can see the fake.zip file in results. Click on details ; if the
index-more plugin was activated then you can see that the stored
content-type is application/zip and not text/html

What I suggest is simply to use the content-type used by nutch to find which
parser to use instead of the one returned by the server.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
 I'm not sure if that is the right thing.
 If the site administrator did a poort job and a wrong media type is
 advertized, it's the site
 problem and Nutch shouldn't be fixing it, in my opinion.  Those sites
 would
 not work properly with the browsers any way, and Nutch doesn't need to
 work properly
 except that it should protect itself from crashing.  I tried to visit your
 fake.zip page with
 IE and Firefox, and both faithfully trusted the media type as advertised
 by the server, and
 asked me if I want to open it with WinZip or save it; there was no option
 to open it as an HTML.
 Why should Nutch treat it as HTML?

Simply because it is a HTML file, with a strange name, of course, but it is
a HTML file.
My example is a kind of caricature. But some more real case could be : a
HTML file with a text/plain content-type, or with an text/xml
Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom

2006-04-26 Thread Jérôme Charron
 Is this just needed for references from javadoc?  If so, then this can
 be copied to build/docs, no?

Yes. Committed.

Jérôme


Nutch calendar

2006-04-14 Thread Jérôme Charron
Hi all,

Just for fun, I have created a public nutch calendar on Google Calendar.
You can add it to your Google calendars or acces it via these URLs:
Feed URL is :
http://www.google.com/calendar/feeds/[EMAIL PROTECTED]/public/basic
ICAL URL is :
http://www.google.com/calendar/ical/[EMAIL PROTECTED]/public/basic
Anybody is welcome to edit this calendar. Just contact me so that I add you
to the list of editors.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Content-Type inconsistency?

2006-04-13 Thread Jérôme Charron
I would like to come back on this issue:
The Content object holds two content-types:
1. The raw content-type from the protocol layer (http header in case of
http) in the Content's metadata
2. The guessed content-type in a private field content-type.

When a ParseData object is created, it takes only the Content's metadata.
So, the ParseData can only access the raw content type and not the one
guessed.

What I suggest is :
1. add a content-type parameter in the ParseData constructors (so that
Parsers  can pass the guessed content-type to ParseData).
2. The Content object stores the guessed content-type in it's metadata in a
special attribute named for instance GUESSED_CONTENT_TYPE, so that the
ParseData can access it

I think 1. is really cleanest way to implement this, but there is a lot of
code impacted = all the parsers.
Solution 2. have no impact on APIs, so the code changes are very small.

Suggestions? Comments?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: PMD integration

2006-04-11 Thread Jérôme Charron
  Piotr, please keep oro-2.0.8 in pmd-ext
 I do not agree here - we are going to make a new release next week and
 releasing with two versions of oro does not look nice. oro is quite
 stable product and changes are in fact minimal:
 http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES

OK for me.
But we cannot make a release without minimal tests.
(I will made some tests for removing oro from nutch's regex for post 0.8release)

Jérôme


Content-Type inconsistency?

2006-04-10 Thread Jérôme Charron
It seems there is an inconsistency with content-type handling in Nutch:

1. The protocol level content-type header is added in content's metadata.
2. The content-type is then checked/guessed while instanciating the Content
object and stored in a private field
(at this step, the Content object can have 2 different content-types).
3. The Content's private field for content-type is used to find the good
parser.
4. Once the Parse object is constructed, the Content is no more used (= the
guessed content-type is lost)
5. Then the index-more plugin index the raw content-type and not the guessed
one
6. As a consequence the content-type displayed in more.jsp is the raw one,
and the one used to query on type is the raw one too.

Wouldn't it be better to always use the guessed content-type all along the
process?
(except in cache.jsp, where the raw one should be used)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [Proposal] New Lucene sub-project

2006-04-10 Thread Jérôme Charron
 I found your idea very interesting. I will be interested to contribute to
 the Parse Plugins Framework. I have developed similar one using Lucene.
 The
 project name is Lius.

Hi Rida,

Yes, I know Lius.
It seems very interesting, and I think it would be very interesting too
if we can merge our efforts  to a common lucene's sub project
(but for the moment, it seems that the tika project  doesn't cause a lot of
interest...?)

If you are interested please let me know.

If nutch-dev are interested to create such a project, you are welcome.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: PMD integration

2006-04-07 Thread Jérôme Charron
  that right now it is checking only main code (without plugins?).
 Yes, that's correct -- I forgot to mention that. PMD target is hooked up
 with tests and stops the build if something fails. I thought the core
 code should be this strict; for plugins we can have more relaxed rules

-1
Since plugins provides a lot of Nutch functionalities (without any plugin,
Nutch provides no service), I think that plugins code should be as strict as
the core code.

Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-07 Thread Jérôme Charron
  My feeling was simply that the closest we are to Nutch-1.0, the more be
 need
  some QA metrics (for us and for nutch users). No?
 I absolutely agree Jérôme, really. It's just that developers usually
 tend to hook up dozens of QA plugins and never look at what they output
 (that's the usual scenario with Maven-built projects that I observed).

Yes, that's right...;-)

What I think we need is a QA _person_ rather than just tools. But I'm
 always a bit skeptical, don't take it personally ;)

I absolutely agree Dawid. But I don't think Nutch has enought human
resources
to have a QA person.
I will make a try to integrate a code coverage tool, and see if it gives us
some good
indices on unit tests needed efforts.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: PMD integration

2006-04-07 Thread Jérôme Charron
 I will make it totally separate target (so test do not
 depend on it).

+1


 The goal is to allow other developers to play with pmd easily but at the
 same time I do not want the build to be affected.

+1


 I would like also to look at possibility to generate crossreferenced HTML
 code from Nutch sources as it looks like pmd can use it and violation
 reports would be much easier to read.

+1

Thanks Piotr (and Dawid too of course)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Jérôme Charron
 Do you guys have any additional insights / suggestions whether NUTCH-240
 and/or NUTCH-61 should be included in this release?

NUTCH-240 : I really like the idea, but for now, I agree with that is API is
still ugly. I would like to help in the next weeks...
So for me it should not be included in the 0.8 release...

Regards

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-06 Thread Jérôme Charron
 With code coverage... I don't know. It's
 up to you guys -- you spend much more time on Nutch code than I do and
 you know best what is needed and what isn't.

My feeling was simply that the closest we are to Nutch-1.0, the more be need
some QA metrics (for us and for nutch users). No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Jérôme Charron
 PMD looks like a useful such tool:
 http://pmd.sourceforge.net/ant-task.html
 I would not be opposed to integrating PMD or something similar into
 Nutch's build.xml.  What do others think?  Any volunteers?

+1 (Very configurable, very good tool!)


Re: Refactoring some plugins

2006-03-31 Thread Jérôme Charron
 I'm reluctant to move the extension interface away from the parameter
 and return value classes used by that interface.

I'm reluctant too... I asked, in case someone has a magic idea...


   Could we instead add a
 super-interface that all extension-point interfaces extend?  That way
 all of the extension points would be listed in javadoc as
 implementations of this interface.

+1 ... Committed.

One more question about javadoc (I hope the last one):
Do you think it makes sense to split the plugins gathered into the Misc
group
into many plugins (such as index-more / query-more), so that each sub-plugin
can be dispatched into proper Group. Another solution could be to use in
these
plugins different packages for each extension it provides.

For instance, for the language-identifier plugin, we can be split it in the
following plugins:
* language-identifier
* parse-lang
* index-lang
* query-lang

Or simply refactor it into the following packages:
org.apache.nutch.analysis.lang
org.apache.nutch.parse.lang
org.apache.nutch.indexer.lang
org.apache.nutch.searcher.lang

Jérôme


Re: Refactoring some plugins

2006-03-31 Thread Jérôme Charron
 No, I don't think so.  These are strongly related bundles of plugins.
 When you change one chances are good you'll change the others, so it
 makes sense to keep their code together rather than split it up.  Folks
 can still find all implementations of an interface in the javadoc, just
 not always grouped together in the table of contents.

So, we agreed.


 We could instead of calling these misc call them compound plugins or
 something.  We can change the package.html for each to list the
 coordinated set of plugins they provide.  For example,
 language-identifier's could say something like, Includes parse, index
 and query plugins to identify, index and make searchable the identified
 language.

I plan to review all the package.html ... I will include those changes.
Thanks!

Jérôme


Re: Refactoring some plugins

2006-03-29 Thread Jérôme Charron
 I don't think it upside down.  Plugins should not share packages with
 core code, since that would permit them to use package-private APIs.
 Also, re-arranging the code to make the javadoc nice is right, since the
 javadoc is a primary means of describing the code.

Yes, but what I mean is that it is stange that it is a documentation issue
that
raise this need for refactoring.

Moreover, I would like to suggest some other javadoc improvements (?):

1. Create a group for abstract plugins (like lib-http or lib-regex-filter)
named for instance Plugins API
2. Create a group for extensions points (As far as I remember, one of the
first problem when you want
to extend nutch is to found where are the hooks, ie what are the extensions
points). One more time, since the
javadoc groups are filtered by packages, each extension point interface must
be moved to specific package.
The idea is then to move all the core extensions points to a new package
(for instance org.apache.nutch.api).
3. Create many javadoc plugins groups (one for each major kind of plugin :
Indexing, Parsing, Protocol, Query, UrlFilter and
Misc for those that cannot be categorized).

Thanks for your suggestions and comments.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Spelling suggestion for RSS Feed

2006-03-28 Thread Jérôme Charron
 I've implemented the spelling correction for the RSS Opensearch feed,
 hopefully in keeping with the opensearch guidelines.
 If this format is ok, I'll submit an optional patch alongside the
 current one at http://issues.apache.org/jira/browse/NUTCH-48 .

+1

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
 Beside that, we may should add a kind of timeout to the url filter in
 general.
 Since it can happen that a user configure a regex for his nutch setup
 that run in the same problem as we had run right now.
 Something like below attached.
 Would you agree? I can create a serious patch and test it if we are
 interested to add this as a fail back into the sources.

+1 as a short term solution.
In the long term, I think we should try to reproduce it and analyze what
really happen.
(I will commit some minimal unit test in the next few days).

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
 If it were easy to implement all java regex features in
 dk.brics.automaton.RegExp, then they probably would have.  Alternately,
 if they'd implemented all java regex features, it probably wouldn't be
 so fast.  So I worry that attempts to translate are doomed.  Better to
 accept the differences: if you want the speed, you must use restricted
 regexes.

That's right. It is a deterministic API = more speed, but less
functionality.


 3. Add new plugins that use dk.brics.automaton.RegExp, using different
 default regex file names.  Then folks can, if they choose, configure
 things to use these faster regex libraries, but only if they're willing
 to write the simpler regexes that it supports.  If, over time, we find
 that the most useful regexes are easily converted, then we could switch
 the default to this.

+1
I will doing it this way.
Thanks Doug.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Jérôme Charron
   I updated to the latest SVN revision (385691) today, and I am now seeing
 a
 Null Pointer exception in the AnalyzerFactory.java class.

Fixed (r385702). Thanks Chris.


 NOTE: not sure if returning null is the right thing to do here, but hey,
 at
 least it made my crawl finish! :-)

It is the right thing to do.

Cheers,

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
 It's not only faster, it also scales better for large and complex
 expressions, it is also possible to build automata from several
 expressions with AND/OR operators, which is the use case we have in
 regexp-utlfilter.

It seems awesome!
Does somebody plans to switch to this lib in nutch?
Does the BSD license is compatible with ASF one?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
 Thanks for volunteering, you're welcome ... ;-)

Good job Andrzej !;-)
So, That's now in my todo list to check the perl5 compatibility issue and to
provide some benchs to the community...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: quality of search text

2006-03-10 Thread Jérôme Charron
 I think algortihm # 1 is what google uses.
 google ignores content that does not change from page to page, as well
 as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=enhs=otTlr=c2coff=1safe=offclient=firefox-arls=org.mozilla:en-US:officialpwst=1q=+site:gamingalmanac.com+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed in
summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus and
navigation, it will be a real improvement.
A first step, that I have developed in a previous project many years ago is
to remove pages that contains textual content only in links: it avoid
indexing frames or iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


AnalyzerFactory

2006-03-10 Thread Jérôme Charron
It seems that the usage of AnalyzerFactory was removed while porting Indexer
to map/reduce.
(AnalyzerFactory is no more called in trunk code)
Is it intentional?
(if no, I have a patch that I can commit, so thanks to confirm)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Jérôme Charron
In fact, my first need was to be able to configure the boost for
RawFieldQueryFilter.
The idea is then to give to the user a better control of boost values by
simply :
* add a setBoost(float) method to RawFieldQueryFilter.
* (add a setLowerCase(boolean) method to RawFieldQueryFilter)
* Add some configuration properties for boost values for actual
RawFieldQueryFilters: (CC|Type|RelTag|Site|Language)QueryFilter

Do you think it makes sense to commit such changes?
(or is it just a very focused need I actually have)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
 In a distributed configuration one needs to rebuild the job jar each
 time anything changes, and hence must check all plugins, etc.  So I
 would appreciate it if this didn't take quite so long.

Make sense!
Here is my proposal. For each plugin:
* Define a target containing core (will be used when building single plugin)
* Define a target not containing core (will be used when building whole
code)
I commit this as soon as possible.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java

2006-03-03 Thread Jérôme Charron
 Adding DOAP for Nutch.  Contributed by Chris Mattmann.

 Added:
 lucene/nutch/trunk/site/doap.rdf
 Modified:
 lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
 lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
 lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
 lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
 lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
 lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDbReader.java
 lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java

 lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
 lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java
 lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
 lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java

 lucene/nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java

 
 lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java

 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

It seems that NUTCH-143 patch has been commited too... is it intentional?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
On 3/3/06, Doug Cutting [EMAIL PROTECTED] wrote:

 Jérôme Charron wrote:
  Here is my proposal. For each plugin:
  * Define a target containing core (will be used when building single
 plugin)
  * Define a target not containing core (will be used when building whole
  code)
  I commit this as soon as possible.

 That sounds perfect.  Thanks!

Committed.
Quick benchs:
* Before : around 70s
* After : around 50s

Better, but not so perfect...  :-(

Jérôme


Re: Nutch Parsing PDFs, and general PDF extraction

2006-03-02 Thread Jérôme Charron
 This is something google does very well, and something nutch must match
 to compete.

Richard, it seems you are a real pdf guru, so any code contribution to nutch
is welcome.
;-)

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: PDF Parse Error

2006-03-02 Thread Jérôme Charron
 Yes, but please do not cross-post - many of us are subscribed to both
 groups, and we're getting multiple copies of your posts...

+1

I agree, this is inconsistent and should be changed. I think all places
 should use -1 as a magic value, because it's obviously invalid.

 +1
Richard, could you please create a jira issue about this.
Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-02 Thread Jérôme Charron
 Calling compile-core for every plugin makes builds really slow.

I was surprised that nobody complain about this...   ;-)


   I
 think it's safe to assume that the core has already been compiled before
 plugins are compiled.  Don't you?

It just ensure that the last modified core version is automatically compiled
while compiling a single plugin.
From my point of view the time for a whole build is not a problem.
If I just work on core, then I can use the fast compile-core target.
And if I just work on a plugin, I only compile the plugin.
Finally I use the global compilation very rarely.
But perhaps that's not your case, and so it makes sense to reduce time of
the whole build.

Jérôme


Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
 http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa
 ge-summary.html org.apache.nutch.parse.pdf (Nutch 0.7.1 API)
 but I dont see it in the source of 0.7.1 downloaded

 I see it on cvs here:
 http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s
 rc/java/net/nutch/parse/pdf/

First of all, the nutch source code is no more hosted on sourceforge, but on
apache:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/

The classes packages has also been changed to org.apache.nutch

but my nutch doesn't seem to run the pdf parse class as my log file
 shows it fecthing pdfs, but saying nutch is unable to parse content type
 application/pdf
 Why is this?  Was it left out because of performace?

Do you have activated the parse-pdf plugin in conf/nutch-default.xml or
conf/nutch-site.xml ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
 Putting the wellformed version of the plugin code you provided generated
 the follwong exception:

Does the nutch-extensionpoints plugin is activated?


Re: duplicate libs

2006-02-16 Thread Jérôme Charron
 Sounds very good! I may missed - that are you able to extract the
 dependencies from the plugin.xml without hacking ant?

Yes, by using the xmlproperty task: it defines a property for each path
found in the xml document
( http://ant.apache.org/manual/CoreTasks/xmlproperty.html )

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: duplicate libs

2006-02-15 Thread Jérôme Charron
 Yes, there is an easier way. Implement a custom task to which you'll
 pass a path to plugin.xml and a name for a path. The task (Java code)
 will create a named (id) path object which can be subsequently used in
 ant with classpath refid=xxx /.

 This requires a custom ant task, but as you mentioned foreach is also a
 separate library, so I don't see a huge disadvantage.

 Carrot2 codebase contains similar fuctionality in carrot2-ant-extensions
 module, although it should be trivial to implement it from scratch.

Thanks Dawid for all these informations.
I really prefer your proposed way.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: duplicate libs

2006-02-15 Thread Jérôme Charron
 may you will find that interesting also:
 http://maven.apache.org/using/multiproject.html

Thanks Stefan.
Maven seems to be a really good project software management tool.
But for now, I don't plan to migrate to maven...
(I don't have enought knowledge about it and so I don't have a good overview
of it).

Regards

Jérôme


Re: duplicate libs

2006-02-14 Thread Jérôme Charron
  There are a number of duplicated libs in the plugins, namely:

Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196?
I have still provided a patch for a log4j lib.
If there is no objection, I will commit it and go ahead for
* lib-commons-httpclient
* lib-nekohtml

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Empty Parse

2006-02-09 Thread Jérôme Charron
Hi all,

I just notice an inconsistency when there is a parsing failure :

1. The Fetcher return an empty ParseImpl instance (it contains no metadata,
especially SEGMENT_NAME_KEY and SIGNATURE_KEY)
2. Then, the Indexer tries to add the fields segment and digest from the
metadata keys (SEGMENT_NAME_KEY and SIGNATURE_KEY) to the document.
Unforunately these values are null, a NPE is thrown and the process failed.

My question is : what behaviour is expected in such case?
1. Fetcher must add the SEGMENT_NAME and SIGNATURE metadata in empty
ParseImpl?
2. The Indexer must ignore documents without SEGMENT_NAME and SIGNATURE?
3. Both?

My feeling is 3, but I prefer that we discuss this point before
committing...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Jakarta-POI 3.0-alpha1

2006-02-09 Thread Jérôme Charron
Hi,

I have made some experiments with the 3.0-alpha1 version of Jakarta POI
(used by parse-msword and parse-mspowerpoint).
Since this version contains the hwpf package it enables to parse msword
documents too (the actual version in lib-jakarta-poi plugin doesn't contain
this package).
The benefit is that we can remove the poi-2.1 jars bundled with parse-msword
and simply add a dependency to the lib-jakarta-poi plugin (like for
parse-mspowerpoint) : Just one version of POI libs is bundled in Nutch.
I had performed some tests on a lot of zipped doc files (cool to test two
plugins at the same time) from the 3GPP site and all is working fine.
I do not perform a lot of tests on powerpoints, but unit tests are ok.

If there is no objection, I will commit changes by the end of the week.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Empty Parse

2006-02-09 Thread Jérôme Charron
 Is this happening with the latest code?

Yes.
But by looking in the svn repository . it is my fault ... sorry
(NUTCH-139)
I fix that right now.

Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


javaswf.jar

2006-02-06 Thread Jérôme Charron
Hi,

It seems that the javaswf.jar lib was builded using jdk 1.5:
class file has wrong version 49.0, should be 48.0

Does I missed something, or Nutch should still be compiled using jdk 1.4.x ?
Please confirm, so that I can commit a new javaswf.jar builded with jdk 1.4

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Cmd line for running plugins

2006-02-02 Thread Jérôme Charron
+1

On 2/1/06, Stefan Groschupf [EMAIL PROTECTED] wrote:

 +1

 Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki:

  Hi,
 
  I just found out that it's not possible to invoke main() methods of
  plugins through the bin/nutch script. Sometimes it's useful for
  testing and debugging - I can do it from within Eclipse, because I
  have all plugins on the classpath, but from the command-line it's
  not possible - in the code they are accessed through
  PluginRepository. So I added this:
 
 public static void main(String[] args) throws Exception {
   NutchConf conf = new NutchConf();
   PluginRepository repo = new PluginRepository(conf);
   // args[0] - plugin ID
   PluginDescriptor d = repo.getPluginDescriptor(args[0]);
   if (d == null) {
 System.err.println(Plugin ' + args[0] + ' not present or
  inactive.);
 return;
   }
   ClassLoader cl = d.getClassLoader();
   // args[1] - class name
   Class clazz = Class.forName(args[1], true, cl);
   Method m = clazz.getMethod(main, new Class[]{args.getClass()});
   String[] subargs = new String[args.length - 2];
   System.arraycopy(args, 2, subargs, 0, subargs.length);
   m.invoke(null, new Object[]{subargs});
 }
 
  It works rather nicely. If other people find it useful, I can add
  this to PluginRepository.
 
  --
  Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 
 

 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net






--
http://motrech.free.fr/
http://www.frutch.org/


Re: xml-parser plugin contribution

2006-01-24 Thread Jérôme Charron
 Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a
 new issue and attach the file.

Perhaps you can use this already existing issue
http://issues.apache.org/jira/browse/NUTCH-23

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: lang identifier and nutch analyzer in trunk

2006-01-24 Thread Jérôme Charron
 Is it reasonable to guess language info. from target servers geographical
 info.?

Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.

For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For HTML documents, the HTMLLanguageParser scans HTML documents looking at
possible indications of content language:
1. html lang attribute
2. meta dc.language
3. meta http-equiv
The first one found is assumed to be the document's language.
Then if no language is found, the statistical language identifier is
used

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
 Any plan to implement this ? I mean move LanguageIdentifier class
 intto nutch core.

As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...

Here is a short status of what I have in mind for next improvements with the
LanguageIdentifier / MultiLanguage support :
* Enhance LanguageIdentifier APIs by returning something like an ordered
LangDetail[] array when guessing language (each LangDetail should contains
the language code and its score) - I have a prototype version of this on my
disk but I doesn't take time to finalize it
* I encountered some identification problems with some specific sites (with
blogger for instance), and I plan to investigate on this point.
* Another pending task : the analysis (and coding) of multilingual querying
support.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
 +1. Other local modifications which I use frequently:

 * exporting a list of supported languages,

 * exporting an NGramProfile of the analyzed text,

 * allow processing of chunks of input (i.e.
 LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
 very useful if the text to be analyzed is already present in memory, and
 the choice of sections (chunks) is made elsewhere, e.g. for documents
 with clearly outlined sections, or for multi-language documents.

Thanks for these intereseting comments Andrzej = I add them to my todo
list.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jérôme Charron
 I am wondering Analyzer of nutch in svn trunk is chosen by
 languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).

It's not really choosen by the languageidentifier, but coosen regarding the
value of the lang attribute (for now, that's right, only the
languageidentifier add this attribute).


 In org.apache.nutch.indexer.Indexer.class line 104
 writer.addDocument((Document)((ObjectWritable)value).get());
 It should be
 NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get(lang));
 writer.addDocument((Document)((ObjectWritable)value).get(), analyzer );
 right?

Yes, it should.
Thanks for noticing this.
Merge problem?
(I don't remember to add this in nutch-0.7 ...)


 Once more,query parsing should call AnalyzerFactory?? The query input
 is multi-lingual also.

The query part is not yet implemented.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Jérôme Charron
 the following code would fail in case the meta tags are in upper case

 Node nameNode = attrs.getNamedItem(name);
 Node equivNode = attrs.getNamedItem(http-equiv);
 Node contentNode = attrs.getNamedItem(content);

This code works well, because Nutch HTML Parser uses Xerces implementation
HTMLDocumentImpl object that lowercased attributes (instead of elements
names that are uppercased).
For consistency and to decouple a little Nutch HTML Parser and Xerces
implementation, I suggest to change these lines by something like:
Node nameNode = null;
Node equivNode = null;
Node contentNode = null;
for (int i=0; iattrs.getLength(); i++) {
  Node attr = attrs.item(i);
  String attrName = attr.getNodeName().toLowerCase();
  if (attrName.equals(name)) {
nameNode = attr;
  } else if (attrName.equals(http-equiv)) {
equivNode = attr;
  } else if (attrName.equals(content)) {
contentNode = attr;
  }
}


Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/


Re: ParserFactory test fail

2006-01-10 Thread Jérôme Charron
Hi Stefan,

No in fact, I have refactored the code of protocol-http plugins, not html
parser.
So, I don't think the log4 error comes from this code.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: test suite fails?

2006-01-09 Thread Jérôme Charron
I have the same problem too.
I don't understand what happens.
In fact, the CommandRunner returns a -1 exit code, but nothing in the error
output and the good string in the standard output (nutch rocks nutch rocks
nutch rocks).
All seems to be ok but the exit code.

Jérôme

On 1/9/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:

 It fails on my machine on parse-ext tests. I am not sure what is causing
 it yet and I am afraid I do not have time to investigate it today -
 maybe in few days. I did a small change to make it compile a few days
 ago, but all tests went ok before I committed it.
 Regards
 Piotr
 Stefan Groschupf wrote:
  Hi,
 
  is anyone able to run the test suite without any problems?
 
  Stefan
 
  ---
  company:http://www.media-style.com
  forum:http://www.text-mining.org
  blog:http://www.find23.net
 
 
 




--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Jérôme Charron
... in fact, not really... really unrelated !!!
I remove it immediately.
Thanks

On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] wrote:
  --- lucene/nutch/trunk/src/plugin/build.xml (original)
  +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan  8 16:13:42 2006
  @@ -6,13 +6,14 @@
 !-- Build  deploy all the plugin jars.--
 !-- == --
 target name=deploy
  - !--ant dir=analysis-de target=deploy/--
  - !--ant dir=analysis-fr target=deploy/--
  + ant dir=analysis-de target=deploy/
  + ant dir=analysis-fr target=deploy/

 Was this change intentional?  It looks unrelated.

 Otherwise, this looks great!

 Doug




--
http://motrech.free.fr/
http://www.frutch.org/


Re: problems http-client

2006-01-06 Thread Jérôme Charron
  A related issue is that these two plugins replicate a lot of code.  At
  some point we should try to fix that.  See:
 
 
 http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html

I have beginning working on this. Nobody else? Can I go on?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: no static NutchConf

2006-01-04 Thread Jérôme Charron
 Excuse me in advance, I probably missed something, but what are the use
 cases for having many NutchConf instances with different values?
 Running many different tasks in parallel, each using different config,
 inside the same JVM.

Ok, I understand this Andrzej, but it is not really what I call a use case.
It is more a feature that you describe here.
In fact, what I mean is that I don't understand in which cases it will be
usefull. And I don't understand how a particular
NutchConfig will be selected for a particular task...

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Static initializers

2005-12-20 Thread Jérôme Charron
Andrzej,

How do you choose the NutchConf to use ?
Here is a short discussion I had with Doug about a kind of dynamic NutchConf
inside the same JVM:

... By looking at the mailing lists archives it seems that having some
behavior depending on the documents URL is a recurrent problem (for instance
for boosting documents matching a url pattern - NUTCH-16 issue, and many
other topics).
So, our idea is to provide a way to provide a dynamic nutch configuration
(that override the default one, like for the nutch-site) based on documents
matching urls pattern. The idea is as follow:

1. The default configuration is as usualy the nutch-default.xml file

2. An xml file can map some url regexp to some many others configurations
files (that override the nutch-default):
nutch:conf
  url regexp=http://www.mydomain1.com/*;
!-- A set of nutch properties that override the nutch-default for this
domain --
property
nameproperty1/name
valuevalue1/name
/property

   /url
   
/nutch:conf

What do you think about this?


Looking deeper, this is more messy that I thought... Some changes would
 be required to the plugin instantiation mechanisms, e.g.:

 Extension.getExtensionInstance() - getExtensionInstance(NutchConf)
 ExtensionPoint.getExtensions() - getExtensions(NutchConf)
 PluginRepository.getExtensionPoint(String) -
 getExtensionPoint(String, NutchConf)

 etc, etc...

 The way this would work would be similar to the mechanism described
 above: if plugin instances are not created yet, they would be created
 once (based on the current NutchConf argument), and then cached in this
 NutchConf instance.

 And also the plugin implementations would have to extend
 NutchConfigured, taking NutchConf as the argument to their constructors
 - because now the Extension.getExtensionInstance would pass the current
 NutchConf instance to their contructors.

That's exactly what I had in mind while speaking about a dynamic NutchConf
with Doug.
For me it's a +1
The only think I don't really like is extending the NutchConfigured, but it
is the most secured way to implement it.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Latest version of Mapred

2005-12-19 Thread Jérôme Charron
 Thanks for the fast response,
 Do you know where I can find a compressed version?

Here are the nightly builds:
http://cvs.apache.org/dist/lucene/nutch/nightly/

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: vote results.

2005-12-15 Thread Jérôme Charron
 Just continue voting I will continue with  my tally sheet. :-)

Why not creating a wiki page... so that you don't have to do this bad
work.

Jérôme


Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Jérôme Charron
 What people think if we collect a list of issues and make a voting
 iteration?

+1


Hard-coded Content-type checks

2005-12-13 Thread Jérôme Charron
Hi,

I would like to remove all the hard-coded content-type checks spread over
all the parse plugins.
In fact, the content-type/plugin-id mapping is now centralized in the
parse-plugin.xml file, and there's no
more needs for the parser to check the content-type.
The basic idea was:
1. The developer has the responsibility to add in the plugin.xml of his
parser the content-type(s) handled.
2. Then, the administrator has the ability to use a parser for any
content-type he wants.
3. The ParserFactory WARN the administrator if a parser is mapped to a
content-type that was not initially designed to handle this content-type
(from the plugin.xml file).
So there is no more needs for hard-coded content-type checks.
That's the administrator responsibility to take care of the
content-type/plugin-id mappings.

For instance, in my use case, I have added the application/xhtml+xml
content-type mapped to the parse-html parser.
But with the actual hard coded content-type check in parse-html, the
parse-html plugin cannot handled the application/xhtml+xml content.

If there is no objection, I will commit these changes in the next hours.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
+1
A simple solution that provides a standard way to access common meta data.
Great!

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Jérôme Charron
+1 for a 0.7.2 release.
Here are the issues/revisions I can merge to 0.7 branch.
These changes mainly concern the parser-factory changes (NUTCH-88)

http://issues.apache.org/jira/browse/NUTCH-112
http://issues.apache.org/jira/browse/NUTCH-135
http://svn.apache.org/viewcvs.cgi?rev=356532view=rev
http://svn.apache.org/viewcvs.cgi?rev=355809view=rev
http://svn.apache.org/viewcvs.cgi?rev=354398view=rev
http://svn.apache.org/viewcvs.cgi?rev=326889view=rev
http://svn.apache.org/viewcvs.cgi?rev=321250view=rev
http://svn.apache.org/viewcvs.cgi?rev=321231view=rev
http://svn.apache.org/viewcvs.cgi?rev=306808view=rev
http://svn.apache.org/viewcvs.cgi?rev=293370view=rev
http://svn.apache.org/viewcvs.cgi?rev=292865view=rev
http://svn.apache.org/viewcvs.cgi?rev=292035view=rev

 [EMAIL PROTECTED]
Piotr, what about the italian translation?
0.7.2 could be a good candidate for a commit. no?

 This has been fixed in the mapred branch, but that patch is not in
  0.7.1 .  This alone might be a reason to make a 0.7.2 release.

http://svn.apache.org/viewcvs.cgi?view=revrev=348533

 I would be happy to see some more parser selection problems fixed but
  looks like Jerome is working  hard also to get stuff fixed, may we  can
  wait until that.

I think we can wait for the enhancement proposed by Chris today: Adding an
alias in parse-plugin.xml file and use a content-type/extension-id mapping
instead of content-type/plugin-id.
For further improvements, the new mime-type repository based on freedesktop
mime-type will be needed.
I cannot reasonably include this in 0.7.2, but I think it will be in trunk
by the end of the year.

What reasonable target date can we planned for a 0.7.2 ?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
 The total number of hits (approx) is 2,780,000,000. BTW, I find it
 curious that the last 3 or 6 digits always seem to be zeros ... there's
 some clever guesstimation involved here. The fact that Google Suggest is
 able to return results so quickly would support this suspicion.

For more informations about fake Google counts, I suggest you to take a
look to some
tests performed by Jean Véronis, a French academic :
http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fecth process.
No?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Jérôme Charron
Sounds really good (and it is requested by a lot of nutch users!).
+1

Jérôme

On 12/1/05, Doug Cutting [EMAIL PROTECTED] wrote:

 Matt Kangas wrote:
  #2 should be a pluggable/hookable parameter. high-scoring sounds  like
  a reasonable default basis for choosing recrawl intervals, but  I'm sure
  that nearly everyone will think of a way to improve upon  that for their
  particular system.
 
  e.g. high-scoring ain't gonna cut it for my needs. (0.5 wink ;)

 In NUTCH-61, Andrzej has a pluggable FetchSchedule.  That looks like a
 good idea.

 http://issues.apache.org/jira/browse/NUTCH-61

 Doug




--
http://motrech.free.fr/
http://www.frutch.org/


Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
 Right, but the URL filters run long before we know the mime type, in
 order to try to keep us from fetching lots of stuff we can't process.
 The mime type is not known until we've fetched it.

Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the document's URL.
So, another alternative, could be to exclude only files extensions that are
registered in the mime-type repository
(some well known file extensions) but for which no parser is activated. And
accepting all other ones.
So that the .foo files will be fetched...

Jérôme


Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Jérôme Charron
 Do we talk about parsing rdf or do we discuss to store parsed html
 text in rdf and convert it via xslt to pure text?
 I may misunderstand something. I very like the idea of a general rdf
 parser. Back in the days i played around with jena.sf.net
 Parsing yes, replace nutch sequence file and the concept of Wriatbles
 with xml - is from my point of view a bad idea.

One more time. Please read the proposal one more time and my responses.
The proposal doesn't suggest to replace the way data are stored in Nutch.
It is just a proposal of a generic xml parser (as the title suggest it)


 :-) I'm the last that inhibit innovation, but I would love to see
 nutch able to parse billion of pages.

Today, parsing billion of pages is not the only challenge of search engines
(look at Google that no more displays the number of indexed pages)
The parsing of a lot of content types, the language technologies (language
specific stemmatization, analysis, querying, summarization, ...) are some
other new challenges...
The low level challenges are importants, but they must not be a brake for
high level processes.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Lucene or Nutch

2005-11-10 Thread Jérôme Charron
 I would be disappointed by this move - language identifier is an
 important component in Nutch. Now the mere fact that it's bundled with
 Nutch encourages its proper maintenance. If there is enough drive in
 terms of willingness and long-term commitment it would make sense to
 move it to a separate project on its own (or maybe as a part of Jakarta
 Commons), but moving it into a catch-all purely optional category like
 Lucene contrib would increase risks that it slides into oblivion...

Ok, Andrzej, I really understand your meaning.
But more and more people are contacting me directly in order to use the
language-identifier, but not as a nutch plugin, simply as a standalone
library. They get confused when I explain them that they need the nutch jar
in order to use the language-identifier. That's why I would like to make it
a standalone
jar. A short-term solutions could be to move the core classes (which have no
dependencies on
nutch) to a new lib-plugin (lib-lang for instance and adding a dependecy to
this plugin in the
language-identifier), so that this code could be used as a standalone lib.

Are you ok, with such changes?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron
 Yes, Lucene is the best fit for what you're after. Nutch is built on
 Lucene, and adds web crawling on top. You don't need a web crawler,
 so using Lucene directly is the best fit - of course you'll have to
 write code to integrate Lucene.

Erik,

I was thinking about it for a while, but don't take time to. This mail is a
good oportunity...
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best solution
(standalone vs lucene)?
Doug?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: standard version of log4j

2005-11-07 Thread Jérôme Charron
 hmmm.. so that means if we want to customize logging
 it would be for every plugin potentially?

 Perhaps a common logger would atleast make some degree
 of sense.

I really think it make sense.
When I fixed the issue about plugin dependencies, I began to create a log4j
plugin
in order to remove all the log4j versions imported by many plugins (what you
suggest).

But it is not so simple.
In fact, parse-rss and parse-pdf uses in their code some log4j imports just
to redirect
the log4j output to the java's native logger (They don't really customize
it).
The imports of log4j are only used by some others jars imported by the
plugins (not a direct dependency).
If these jars the plugins depends on use some common log4j features, it
seems there's no problem to remove
the log4j jars in each plugin and add a dependency to a new lib-log4j
plugin. But the only ways to check for no regression are:
* Look in the source code of PDFBox and other jars imported by plugins and
that use log4j and checks that they are able to use any other log4j-1.2.xversion
* Create a lib-log4j plugin, remove all log4j jars and add a dependency to
lib-log4j plugin in all the plugins that previously imported log4j.jar , and
then perform a runtime test of these plugins and cross fingers

But sure, I really think it make sense.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

2005-10-06 Thread Jérôme Charron
 There is one potential problem that I see -- Nutch plugins require
 explicit JAR references. If you want to switch between algorithms you'll
 need to either put all Carrot2 JARs in the descriptor, put them in
 CLASSPATH before Nutch starts or do some other trickery with class
 loading.

Only available in the trunk, you can also now define some inter-plugins
dependencies
using plugins identifiers instead of explicit jar references. These
dependencies are then
checked for availability and added to the classloader at runtime.
Take a look at analyze-fr and analyze-de plugins that depends on
lib-lucene-analyzers.
You can also notice, that now, for instance all plugins depends on the
nutch-extensionpoints plugin.

For instance, I recently notice that many plugins import a log4j.jar.
It would be a good idea to define a lib-log4j plugin, and add a dependency
on this plugin for
each plugins that import log4j.jar in their lib (of course, we must take
care of the log4j version used)

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
 I think would be neat to have the NutchAnalyzer also as a plugin, from my
 understanding right now if I want to analyze in a different way, I need to
 hack the nutch source code, if we are going to have different plugins for
 different analyzers that will help. Some specific application may use
 porter
 analyzer, some other uses Snowball for Italian ..., with the plugin
 approach
 these will coexist nicely.

 Same thing would be for providing summaries, for instance if we enable
 clustering the way is summarized the search result helps to have
 meaningful
 clusters.

 Let me know if you find it as an attractive feature ;-), I can find some
 free-time and do the coding.


Yes, it is definitvely an attractive feature!

I have recently commited in the trunk a support for multi-lingual analyzer
plugins.
There is an Analyzer Extension point, so that you can develop your own
analyze-plugins.
For now, the analyzer factory uses a plugin depending on the result of the
language identifier.
I have committed two analyze plugins, one for french and one for german.
They are just some wrappers
of the Lucene french and german analyzers.
By default, these plugins are not deployed, since:
1. they are at an early testing stage.
2. these analyzers make sense only if some query analyzers are provided too
(not yet done).

You can take a look at the proposal I made earlier (not finished since I
worked on other issues for now):
http://wiki.apache.org/nutch/MultiLingualSupport

Cheers

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
 I read about the MultiLingualSupport, but I didn't see it in the
 repository, I think is cool.

The analyzer extension point is defined by the Analyzer abstract class:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java
The default analyzer is this one:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
The choice of the analyzer to use is done by the AnalyzerFactory:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java
The german analyzer is located at:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-de/
and the french one at:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-fr/

 Yes, I actually hacked the src code to provide stemming and I changed the
 analyzer, added a new query-stemm plugin and changed the summarizer (as the
 terms were not highlighted after using the stemmer).

Sounds good!

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


  1   2   >