Re: End-Of-Life status for 0.7.x?

2008-01-18 Thread Jérôme Charron
+1 On Jan 18, 2008 5:22 AM, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Hi all, My opinion is that we should mark it EOL, and close all JIRA issues that are relevant only to 0.7.x, with the status Won't Fix. +1 -- Sami Siren -- Jérôme Charron Directeur

Re: log guards

2007-02-13 Thread Jérôme Charron
These guards were all introduced by a patch some time ago. I complained at the time and it was promised that this would be repaired, but it has not yet been. Yes, Sorry Doug that's my own fault I really don't have time to fix this :-( Best regards Jérôme

Re: log guards

2007-02-13 Thread Jérôme Charron
On 2/13/07 11:17 AM, Jérôme Charron [EMAIL PROTECTED] wrote: These guards were all introduced by a patch some time ago. I complained at the time and it was promised that this would be repaired, but it has not yet been. Yes, Sorry Doug that's my own fault I really don't have time to fix

Re: implement thai language indexing and search

2006-11-28 Thread Jérôme Charron
i used an existing ThaiAnalyzer which was in lucene package. ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled and placed all class files in a jar - analysis-th.jar (do i need to bundle the ngp file in the jar as well ?) 1. You don't have to refactor the lucene analyzer.

Re: implement thai language indexing and search

2006-11-16 Thread Jérôme Charron
ok. I was able to enable the language identifier plugin by adding the value in plugin.includes attribute in nutch-site.xml - but i'm not sure just by doing that I can have thai text recognized and tokenized properly. What else do I have to do ? Please help me. 1. You must create a thai NGP

Re: Content-type detection for Tika

2006-09-06 Thread Jérôme Charron
I'm thinking about implementing the (draft) shared MIME database spec [1] from freedesktop.org in Tika as a modern MIME magic implementation to help automatically detect and handle the types of resources where insufficient typing metadata is available. The specified typing information also

Re: Antwort: Re: parse-plugins.xml

2006-08-04 Thread Jérôme Charron
What you probably mean is something equivalent to Unix strings(1). I have a plugin that implements this, which I could contribute if there's interest. +1 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Error with Hadoop-0.4.0

2006-07-07 Thread Jérôme Charron
I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In

Error with Hadoop-0.4.0

2006-07-06 Thread Jérôme Charron
Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). In my environment, the crawl command terminate with the following error: 2006-07-06

Re: Possible memory leak?

2006-06-28 Thread Jérôme Charron
It seems to be a side effect of NUTCH-169 (remove static NutchConf). Prior to this, the language identifier was a singleton. I think we should cache its instance in the conf as we do for many others objects in Nutch. Enrico, could you please create a JIRA issue. Thanks Jérôme --

Re: noindedo not index/noindex

2006-06-22 Thread Jérôme Charron
as far I can see nutch's html parser does only support the meta tag noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there is an inoffiziel html noindex tag. http://www.webmasterworld.com/forum10003/2703.htm Hello Stefan, Here is a previous discussion about this :

Re: svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ java/org/apach

2006-06-22 Thread Jérôme Charron
I don't think guards should be added everywhere. That's right Doug. It was a rude first pass on logging. The next pass (finest) will be done with NUTCH-310. Rather, guards should only be added in performance critical code, and then only for Debug-level output. Info and Warn levels are

Re: [Nutch-cvs] svn commit: r414681 - /lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ProtocolFactory.java

2006-06-16 Thread Jérôme Charron
I'm somewhat worried about the possible clash in the conf name-space - usually, when we store Object's in Configuration instance, we use their full class name, or at least a long and most probably unique string. In this case, we use just http, https, ftp, file and so on ... Would it make sense if

Re: [jira] Resolved: (NUTCH-303) logging improvements

2006-06-13 Thread Jérôme Charron
There seems to be two log4j.properties files in generated war, is this intentional? Not intentional. A side effect. In fact, the first one is the one that comes from conf dir (I will exlude it in war so that it will be clearier). The second one (that override the first one) is the good one that

Nutch logging questions

2006-06-09 Thread Jérôme Charron
Hi, I'm currently working on NUTCH-303 so that nutch uses commons logging facade API and log4j as the default implementation. All the code is actually switched to and uses Commons Logging API, and I have replaced some System.out and printStackTrace to make use of Commons Logging. To finalize

Re: Status of language plugin

2006-06-07 Thread Jérôme Charron
Is there an API doc or design doc that I can read to understand where you are? Is the language plugin architecture already in the main trunk? The only available document is http://wiki.apache.org/nutch/MultiLingualSupport and sometimes I maintain this page

Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Jérôme Charron
URL: http://svn.apache.org/viewvc?rev=411943view=rev Log: Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons Logging, configured for log4j by default. If log4j is now included in the core, we can remove the lib-log4j plugin. If no objection, I will doing it. Jérôme --

Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Jérôme Charron
As far I understand hadoop use commons logging. Should we switch to use commons logging as well? Why not... (but using commons logging doesn't exclude to have a default implementation, such as log4j used by hadoop).

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-12 Thread Jérôme Charron
You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron
Add 3. Clustering would benefit from a plain text version. Yes Dawid, but it is already committed = the clustering now uses the plain text version returned by the toString() method. Dawid, I have a question about clustering. Actually, the clustering uses the summaries as input. I assumes it

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron
(but if the nutch-site.xml overrides the plugin.include property and doen't include it it will not be activated, like any other plugin) yes, that's what I ment, I quess that's the default case for people hacking plugins. Oh, yes Sami, I understand what you mean... Sorry, I just forgot to

Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-11 Thread Jérôme Charron
Bob Carpenter of alias-i had this to say when I brought up this very idea: http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599 Thanks for you response Marvin. But finally my question is : shouldn't the nutch clustering uses some fixed size snippets instead of the configurable

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Jérôme Charron
This means there's no markup in the OpenSearch output? Yes, no markup for now. Shouldn't there be? The restriction on description field is : Can contain simple escaped HTML markup, such as b, i, a, and img elements. So, ya, why not. We can add b around highlights. What you and others

Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Jérôme Charron
String toString(Encoder, Formatter) like in the Lucene's Highlighter and provide some basic implementations of Encoder and Formatter. That sounds fine, but in the meantime, let's not reproduce the html-specific code in lots of places. We need it in both search.jsp and in

Re: http chunked content

2006-05-08 Thread Jérôme Charron
As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests issued from http plugin are in HTTP 1.0, so the servers should never return some chuncked content. I think that the

Re: Feature idea - Indexing Text Lengths

2006-05-07 Thread Jérôme Charron
Sorry i cant give more then an idea, I'm not a java developer, but I think the idea could prove useful. The idea is to limit the length of sentences that get entered into the index. So, after parsing a page, and words that don't make what appears to be a complete sentence get ignored. Douglas,

Re: Content-Type inconsistency?

2006-05-02 Thread Jérôme Charron
I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based

Re: [Nutch-cvs] svn commit: r397320 - /lucene/nutch/trunk/src/plugin/parse-oo/plugin.xml

2006-04-27 Thread Jérôme Charron
parse-oo plugin manifest is valid with plugin.dtd Oops, I didn't catch that... Thanks! No problem Andrzej. It is just a cosmetic change since the plugin.xml are not validated at runtime (it is in my todo list), and the contentType and pathSuffix parameters are more or less deprecated. Jérôme

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. Currently, what happens when Content-Type exists in both HTTP layer and in META tag

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
I'm not sure if that is the right thing. If the site administrator did a poort job and a wrong media type is advertized, it's the site problem and Nutch shouldn't be fixing it, in my opinion. Those sites would not work properly with the browsers any way, and Nutch doesn't need to work

Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom

2006-04-26 Thread Jérôme Charron
Is this just needed for references from javadoc? If so, then this can be copied to build/docs, no? Yes. Committed. Jérôme

Nutch calendar

2006-04-14 Thread Jérôme Charron
Hi all, Just for fun, I have created a public nutch calendar on Google Calendar. You can add it to your Google calendars or acces it via these URLs: Feed URL is : http://www.google.com/calendar/feeds/[EMAIL PROTECTED]/public/basic ICAL URL is : http://www.google.com/calendar/ical/[EMAIL

Re: Content-Type inconsistency?

2006-04-13 Thread Jérôme Charron
I would like to come back on this issue: The Content object holds two content-types: 1. The raw content-type from the protocol layer (http header in case of http) in the Content's metadata 2. The guessed content-type in a private field content-type. When a ParseData object is created, it takes

Re: PMD integration

2006-04-11 Thread Jérôme Charron
Piotr, please keep oro-2.0.8 in pmd-ext I do not agree here - we are going to make a new release next week and releasing with two versions of oro does not look nice. oro is quite stable product and changes are in fact minimal: http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES OK for

Content-Type inconsistency?

2006-04-10 Thread Jérôme Charron
It seems there is an inconsistency with content-type handling in Nutch: 1. The protocol level content-type header is added in content's metadata. 2. The content-type is then checked/guessed while instanciating the Content object and stored in a private field (at this step, the Content object can

Re: [Proposal] New Lucene sub-project

2006-04-10 Thread Jérôme Charron
I found your idea very interesting. I will be interested to contribute to the Parse Plugins Framework. I have developed similar one using Lucene. The project name is Lius. Hi Rida, Yes, I know Lius. It seems very interesting, and I think it would be very interesting too if we can merge our

Re: PMD integration

2006-04-07 Thread Jérôme Charron
that right now it is checking only main code (without plugins?). Yes, that's correct -- I forgot to mention that. PMD target is hooked up with tests and stops the build if something fails. I thought the core code should be this strict; for plugins we can have more relaxed rules -1 Since

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-07 Thread Jérôme Charron
My feeling was simply that the closest we are to Nutch-1.0, the more be need some QA metrics (for us and for nutch users). No? I absolutely agree Jérôme, really. It's just that developers usually tend to hook up dozens of QA plugins and never look at what they output (that's the usual

Re: PMD integration

2006-04-07 Thread Jérôme Charron
I will make it totally separate target (so test do not depend on it). +1 The goal is to allow other developers to play with pmd easily but at the same time I do not want the build to be affected. +1 I would like also to look at possibility to generate crossreferenced HTML code from

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Jérôme Charron
Do you guys have any additional insights / suggestions whether NUTCH-240 and/or NUTCH-61 should be included in this release? NUTCH-240 : I really like the idea, but for now, I agree with that is API is still ugly. I would like to help in the next weeks... So for me it should not be included in

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-06 Thread Jérôme Charron
With code coverage... I don't know. It's up to you guys -- you spend much more time on Nutch code than I do and you know best what is needed and what isn't. My feeling was simply that the closest we are to Nutch-1.0, the more be need some QA metrics (for us and for nutch users). No? Jérôme

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Jérôme Charron
PMD looks like a useful such tool: http://pmd.sourceforge.net/ant-task.html I would not be opposed to integrating PMD or something similar into Nutch's build.xml. What do others think? Any volunteers? +1 (Very configurable, very good tool!)

Re: Refactoring some plugins

2006-03-31 Thread Jérôme Charron
I'm reluctant to move the extension interface away from the parameter and return value classes used by that interface. I'm reluctant too... I asked, in case someone has a magic idea... Could we instead add a super-interface that all extension-point interfaces extend? That way all of the

Re: Refactoring some plugins

2006-03-31 Thread Jérôme Charron
No, I don't think so. These are strongly related bundles of plugins. When you change one chances are good you'll change the others, so it makes sense to keep their code together rather than split it up. Folks can still find all implementations of an interface in the javadoc, just not always

Re: Refactoring some plugins

2006-03-29 Thread Jérôme Charron
I don't think it upside down. Plugins should not share packages with core code, since that would permit them to use package-private APIs. Also, re-arranging the code to make the javadoc nice is right, since the javadoc is a primary means of describing the code. Yes, but what I mean is that

Re: Spelling suggestion for RSS Feed

2006-03-28 Thread Jérôme Charron
I've implemented the spelling correction for the RSS Opensearch feed, hopefully in keeping with the opensearch guidelines. If this format is ok, I'll submit an optional patch alongside the current one at http://issues.apache.org/jira/browse/NUTCH-48 . +1 Jérôme -- http://motrech.free.fr/

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
Beside that, we may should add a kind of timeout to the url filter in general. Since it can happen that a user configure a regex for his nutch setup that run in the same problem as we had run right now. Something like below attached. Would you agree? I can create a serious patch and test it

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
If it were easy to implement all java regex features in dk.brics.automaton.RegExp, then they probably would have. Alternately, if they'd implemented all java regex features, it probably wouldn't be so fast. So I worry that attempts to translate are doomed. Better to accept the differences:

Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Jérôme Charron
I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. Fixed (r385702). Thanks Chris. NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) It is the

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! Does somebody plans to switch to this lib in nutch? Does

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: quality of search text

2006-03-10 Thread Jérôme Charron
I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Are you sure? Take a look at this search results:

AnalyzerFactory

2006-03-10 Thread Jérôme Charron
It seems that the usage of AnalyzerFactory was removed while porting Indexer to map/reduce. (AnalyzerFactory is no more called in trunk code) Is it intentional? (if no, I have a patch that I can commit, so thanks to confirm) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: [jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Jérôme Charron
In fact, my first need was to be able to configure the boost for RawFieldQueryFilter. The idea is then to give to the user a better control of boost values by simply : * add a setBoost(float) method to RawFieldQueryFilter. * (add a setLowerCase(boolean) method to RawFieldQueryFilter) * Add some

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
In a distributed configuration one needs to rebuild the job jar each time anything changes, and hence must check all plugins, etc. So I would appreciate it if this didn't take quite so long. Make sense! Here is my proposal. For each plugin: * Define a target containing core (will be used when

Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java

2006-03-03 Thread Jérôme Charron
Adding DOAP for Nutch. Contributed by Chris Mattmann. Added: lucene/nutch/trunk/site/doap.rdf Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
On 3/3/06, Doug Cutting [EMAIL PROTECTED] wrote: Jérôme Charron wrote: Here is my proposal. For each plugin: * Define a target containing core (will be used when building single plugin) * Define a target not containing core (will be used when building whole code) I commit this as soon

Re: Nutch Parsing PDFs, and general PDF extraction

2006-03-02 Thread Jérôme Charron
This is something google does very well, and something nutch must match to compete. Richard, it seems you are a real pdf guru, so any code contribution to nutch is welcome. ;-) Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: PDF Parse Error

2006-03-02 Thread Jérôme Charron
Yes, but please do not cross-post - many of us are subscribed to both groups, and we're getting multiple copies of your posts... +1 I agree, this is inconsistent and should be changed. I think all places should use -1 as a magic value, because it's obviously invalid. +1 Richard, could you

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-02 Thread Jérôme Charron
Calling compile-core for every plugin makes builds really slow. I was surprised that nobody complain about this... ;-) I think it's safe to assume that the core has already been compiled before plugins are compiled. Don't you? It just ensure that the last modified core version is

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa ge-summary.html org.apache.nutch.parse.pdf (Nutch 0.7.1 API) but I dont see it in the source of 0.7.1 downloaded I see it on cvs here: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
Putting the wellformed version of the plugin code you provided generated the follwong exception: Does the nutch-extensionpoints plugin is activated?

Re: duplicate libs

2006-02-16 Thread Jérôme Charron
Sounds very good! I may missed - that are you able to extract the dependencies from the plugin.xml without hacking ant? Yes, by using the xmlproperty task: it defines a property for each path found in the xml document ( http://ant.apache.org/manual/CoreTasks/xmlproperty.html ) Jérôme --

Re: duplicate libs

2006-02-15 Thread Jérôme Charron
Yes, there is an easier way. Implement a custom task to which you'll pass a path to plugin.xml and a name for a path. The task (Java code) will create a named (id) path object which can be subsequently used in ant with classpath refid=xxx /. This requires a custom ant task, but as you

Re: duplicate libs

2006-02-15 Thread Jérôme Charron
may you will find that interesting also: http://maven.apache.org/using/multiproject.html Thanks Stefan. Maven seems to be a really good project software management tool. But for now, I don't plan to migrate to maven... (I don't have enought knowledge about it and so I don't have a good overview

Re: duplicate libs

2006-02-14 Thread Jérôme Charron
There are a number of duplicated libs in the plugins, namely: Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196? I have still provided a patch for a log4j lib. If there is no objection, I will commit it and go ahead for * lib-commons-httpclient * lib-nekohtml Jérôme

Empty Parse

2006-02-09 Thread Jérôme Charron
Hi all, I just notice an inconsistency when there is a parsing failure : 1. The Fetcher return an empty ParseImpl instance (it contains no metadata, especially SEGMENT_NAME_KEY and SIGNATURE_KEY) 2. Then, the Indexer tries to add the fields segment and digest from the metadata keys

Jakarta-POI 3.0-alpha1

2006-02-09 Thread Jérôme Charron
Hi, I have made some experiments with the 3.0-alpha1 version of Jakarta POI (used by parse-msword and parse-mspowerpoint). Since this version contains the hwpf package it enables to parse msword documents too (the actual version in lib-jakarta-poi plugin doesn't contain this package). The benefit

Re: Empty Parse

2006-02-09 Thread Jérôme Charron
Is this happening with the latest code? Yes. But by looking in the svn repository . it is my fault ... sorry (NUTCH-139) I fix that right now. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

javaswf.jar

2006-02-06 Thread Jérôme Charron
Hi, It seems that the javaswf.jar lib was builded using jdk 1.5: class file has wrong version 49.0, should be 48.0 Does I missed something, or Nutch should still be compiled using jdk 1.4.x ? Please confirm, so that I can commit a new javaswf.jar builded with jdk 1.4 Regards Jérôme --

Re: Cmd line for running plugins

2006-02-02 Thread Jérôme Charron
+1 On 2/1/06, Stefan Groschupf [EMAIL PROTECTED] wrote: +1 Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki: Hi, I just found out that it's not possible to invoke main() methods of plugins through the bin/nutch script. Sometimes it's useful for testing and debugging - I can do it

Re: xml-parser plugin contribution

2006-01-24 Thread Jérôme Charron
Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a new issue and attach the file. Perhaps you can use this already existing issue http://issues.apache.org/jira/browse/NUTCH-23 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

2006-01-24 Thread Jérôme Charron
Is it reasonable to guess language info. from target servers geographical info.? Yes, it could be another clue to guess language. But the problem is then to find how to use all these indices. For instance, the actual solution is the easiest one, but certainly not the more efficient one: For

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
Any plan to implement this ? I mean move LanguageIdentifier class intto nutch core. As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
+1. Other local modifications which I use frequently: * exporting a list of supported languages, * exporting an NGramProfile of the analyzed text, * allow processing of chunks of input (i.e. LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is very useful if the text to

Re: lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jérôme Charron
I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). It's not really choosen by the languageidentifier, but coosen regarding the value of the lang attribute (for now, that's right, only the languageidentifier add this

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Jérôme Charron
the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); This code works well, because Nutch HTML Parser uses Xerces

Re: ParserFactory test fail

2006-01-10 Thread Jérôme Charron
Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: test suite fails?

2006-01-09 Thread Jérôme Charron
I have the same problem too. I don't understand what happens. In fact, the CommandRunner returns a -1 exit code, but nothing in the error output and the good string in the standard output (nutch rocks nutch rocks nutch rocks). All seems to be ok but the exit code. Jérôme On 1/9/06, Piotr

Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Jérôme Charron
... in fact, not really... really unrelated !!! I remove it immediately. Thanks On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13

Re: problems http-client

2006-01-06 Thread Jérôme Charron
A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on? Jérôme -- http://motrech.free.fr/

Re: no static NutchConf

2006-01-04 Thread Jérôme Charron
Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. Ok, I understand this Andrzej, but it is not really what I call

Re: Static initializers

2005-12-20 Thread Jérôme Charron
Andrzej, How do you choose the NutchConf to use ? Here is a short discussion I had with Doug about a kind of dynamic NutchConf inside the same JVM: ... By looking at the mailing lists archives it seems that having some behavior depending on the documents URL is a recurrent problem (for instance

Re: Latest version of Mapred

2005-12-19 Thread Jérôme Charron
Thanks for the fast response, Do you know where I can find a compressed version? Here are the nightly builds: http://cvs.apache.org/dist/lucene/nutch/nightly/ Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: vote results.

2005-12-15 Thread Jérôme Charron
Just continue voting I will continue with my tally sheet. :-) Why not creating a wiki page... so that you don't have to do this bad work. Jérôme

Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Jérôme Charron
What people think if we collect a list of issues and make a voting iteration? +1

Hard-coded Content-type checks

2005-12-13 Thread Jérôme Charron
Hi, I would like to remove all the hard-coded content-type checks spread over all the parse plugins. In fact, the content-type/plugin-id mapping is now centralized in the parse-plugin.xml file, and there's no more needs for the parser to check the content-type. The basic idea was: 1. The

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
+1 A simple solution that provides a standard way to access common meta data. Great! -- http://motrech.free.fr/ http://www.frutch.org/

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Jérôme Charron
+1 for a 0.7.2 release. Here are the issues/revisions I can merge to 0.7 branch. These changes mainly concern the parser-factory changes (NUTCH-88) http://issues.apache.org/jira/browse/NUTCH-112 http://issues.apache.org/jira/browse/NUTCH-135 http://svn.apache.org/viewcvs.cgi?rev=356532view=rev

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
The total number of hits (approx) is 2,780,000,000. BTW, I find it curious that the last 3 or 6 digits always seem to be zeros ... there's some clever guesstimation involved here. The fact that Google Suggest is able to return results so quickly would support this suspicion. For more

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file

Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Jérôme Charron
Sounds really good (and it is requested by a lot of nutch users!). +1 Jérôme On 12/1/05, Doug Cutting [EMAIL PROTECTED] wrote: Matt Kangas wrote: #2 should be a pluggable/hookable parameter. high-scoring sounds like a reasonable default basis for choosing recrawl intervals, but I'm sure

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Jérôme Charron
Do we talk about parsing rdf or do we discuss to store parsed html text in rdf and convert it via xslt to pure text? I may misunderstand something. I very like the idea of a general rdf parser. Back in the days i played around with jena.sf.net Parsing yes, replace nutch sequence file and the

Re: Lucene or Nutch

2005-11-10 Thread Jérôme Charron
I would be disappointed by this move - language identifier is an important component in Nutch. Now the mere fact that it's bundled with Nutch encourages its proper maintenance. If there is enough drive in terms of willingness and long-term commitment it would make sense to move it to a

Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron
Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik, I was thinking about it for a while, but don't

Re: standard version of log4j

2005-11-07 Thread Jérôme Charron
hmmm.. so that means if we want to customize logging it would be for every plugin potentially? Perhaps a common logger would atleast make some degree of sense. I really think it make sense. When I fixed the issue about plugin dependencies, I began to create a log4j plugin in order to remove

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

2005-10-06 Thread Jérôme Charron
There is one potential problem that I see -- Nutch plugins require explicit JAR references. If you want to switch between algorithms you'll need to either put all Carrot2 JARs in the descriptor, put them in CLASSPATH before Nutch starts or do some other trickery with class loading. Only

Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
I think would be neat to have the NutchAnalyzer also as a plugin, from my understanding right now if I want to analyze in a different way, I need to hack the nutch source code, if we are going to have different plugins for different analyzers that will help. Some specific application may use

Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
I read about the MultiLingualSupport, but I didn't see it in the repository, I think is cool. The analyzer extension point is defined by the Analyzer abstract class: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java The default analyzer

  1   2   >