Re: VOTE: clustering plugin update for Rel 0.7

2005-08-15 Thread Jérôme Charron
-1 Maybe it would be a better idea to go for 0.7 branch and schedule a new 0.7.1 release in short time? But +1 to include it in a 0.7.1 release !! Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Failing JUnit test

2005-08-19 Thread Jérôme Charron
expected:el but was:sv junit.framework.ComparisonFailure: expected:el but was:sv As I suspect it is a result of latest updates to LanguageIdentifier plugin or its tests. I am not deep into it I will not try to debug it myslef at the moment - so just wanted you to know about the issue. You

Re: Failing JUnit test

2005-08-19 Thread Jérôme Charron
I am using JDK 1.5 on Windows - I can test it on 1.4,1.5 on linux tomorrow - maybe this is the problem. OK. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Failing JUnit test

2005-08-20 Thread Jérôme Charron
It works on my Linux box - with both JDK 1.4 and 1.5. ok. so it seems to be constent with my conf. I will try to track it down. I assume it is an encoding problem of the Ngram profile files, but I have no time evening. Regards Jérôme

Re: svn commit: r240254 - in /lucene/nutch/tags/Release-0.7/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang: HTMLLanguageParser.java LanguageIdentifier.java LanguageIndexingFilte

2005-08-26 Thread Jérôme Charron
It looks like you have commited your changes to tags directory. You should do it in branches. I think there is no way in SVN to force immutability of tags :(. Oops, sorry. I commit my changes in the branches directory right now. Thks Piotr. Regards Jerome -- http://motrech.free.fr/

Re: [Nutch-cvs] svn commit: r240359 - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/indexer/ plugin/nutch-extensionpoints/

2005-08-28 Thread Jérôme Charron
I see several instances of 'analySer' in comments/javadoc and some variables. That should probably be changed to american version - analyzer, for consistency's sake. Yes, that's right. Thanks. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: null lang bug? and patch?

2005-08-31 Thread Jérôme Charron
I did a little digging and it appears that lang ends up being null (couldn't quite track down where lang should have been set). Not sure if it is a proper fix, but changing doc.getField(lang).stringValue() to doc.get(lang), makes my little crawl complete. lang is null cause you don't have

Re: Language identifier plugin questions

2005-08-31 Thread Jérôme Charron
I agree it is important to have the NGramProfile.getSimilarity() method. However, I think it is also important that it is consistent with the scoring that LanguageIdentifier uses, even if LanguageIdentifier optimises the implementation. Looking at the code I see that the two scoring

Re: Language identifier plugin questions

2005-08-31 Thread Jérôme Charron
Tom, I have created the NUTCH-86 issue to report the needed changes in the LanguageIdentifier we discussed in this thread. The issue is available at http://issues.apache.org/jira/browse/NUTCH-86 Regards Jérôme

Re: [Nutch-cvs] svn commit: r240359 - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/indexer/ plugin/nutch-extensionpoints/

2005-08-31 Thread Jérôme Charron
I see several instances of 'analySer' in comments/javadoc and some variables. That should probably be changed to american version - analyzer, for consistency's sake. Corrected/Committed (http://svn.apache.org/viewcvs.cgi?rev=265020view=rev) Regards Jérôme -- http://motrech.free.fr/

Re: [Nutch Wiki] Update of Committer's Rules by AndrzejBialecki

2005-08-31 Thread Jérôme Charron
Glancing at other Apache projects in subversion, I see that httpd uses branch names like 2.2.x and tag names like 2.2.4. That's a little cryptic. I propose that we use branch names like branch-2.4 and tag names like release-2.4.1. What do folks think? +1 Jérôme -- http://motrech.free.fr/

Re: null lang bug? and patch?

2005-08-31 Thread Jérôme Charron
I am a bit lost but just a quick check - shouldn't it also be committed in Release-0.7 branch? No, the analyzer extension-point is commited only in trunk. It's a new feature, so I follow Committer's Rules ( http://wiki.apache.org/nutch/Committer's_Rules) ;-) Regards Jérôme --

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-08-31 Thread Jérôme Charron
Michael, the solution is perhaps to use Jakarta Commons DateUtils.parseDate method: http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[]) It will gives something like: Date parsedDate =

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-09-01 Thread Jérôme Charron
it works great (see the new function bellow). But we'll have to add commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries. Are there any objections? How is the procedure to add it? There's already commons-logging, in nutch libs, so I think there's no problem to add

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-09-01 Thread Jérôme Charron
There's already commons-logging, in nutch libs, so I think there's no problem to add commons-lang. Moreover it is under Apache License, so there's no prolem. I will add it while committing your patch. No objections for adding commons-lang to the nutch lib. As it is a generic lib, I plan

Re: regex-normalize.xml

2005-09-02 Thread Jérôme Charron
to get regex-normalize.xml to work i must put: in nutch-site.xml In nutch-default.xml there is set: Is this a bug or a feature? =) nutch-site.xml overrides properties defined in nutch-default. So: * If you remove urlnormalizer.class property from nutch-default it must still uses the one

Re: regex-normalize.xml

2005-09-02 Thread Jérôme Charron
i think i expressed it wrong. The Question was if its a feature or a bug that regex-normalize.xml is used only after this changes. the regex-normalize.xml is used only after you specify that you want to use the RegexUrlNormalizer implementation. So it is used only if you specify

MS related plugins refactoring

2005-09-05 Thread Jérôme Charron
Hi, I have just committed some modifications that enable to have some dependencies between plugins. I would like to apply this mechanism to parse-ms* related plugins that both uses jakarta poi code. The idea is: instead of duplicating jakarta poi related jar in each lib directory of parse-ms*

Re: MS related plugins refactoring

2005-09-05 Thread Jérôme Charron
I have just committed some modifications that enable to have some dependencies between plugins. This mechanism already works, since a plugin use jar urls from all dependent plugins in its own class-loader. Ok. So, after a long private mail exchange with Stefan (thanks for your time and

Re: Nutch debugging log in Tomcat run time

2005-09-06 Thread Jérôme Charron
The change doesn't reflect in the screen after I re-compile the Nutch code and re-launch the tomcat. Do you re-deploy the web app? -- http://motrech.free.fr/ http://www.frutch.org/

Plugins dependencies enhancement proposal

2005-09-06 Thread Jérôme Charron
Since the plugins can specify some dependencies each over, it raises an administrator problem. For a Nutch administrator, it is not user-friendly to specify which plugins to activate/deactivate. With plugin inter-dependencies, the administrator need to know that a plugin depends on another one

Re: MS related plugins refactoring

2005-09-06 Thread Jérôme Charron
You may should discuss such things before you 'committed' a new feature that already exists. I normally ready most of the nutch mails. What was the date and subject? I may overseen this one. I don't know, it's Stefan's sentence, not mine, so, please ask to Stefan. Regards Jérôme --

Re: RSS Parser Bug!?

2005-09-08 Thread Jérôme Charron
But other than that, your analysis is correct, probably there should be an application/xml added to the list of handled content types. But this is further complicated by the fact, that Nutch doesn't do the right thing now if you have more than one plugin handling the same mime type... I have

Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-08 Thread Jérôme Charron
I may have some time to work on this over the next few days if no one else does. So, if you're taking the lead on this, I volunteer my help if you'd like it. Hi Chris, Thanks for your help. It seems that nobody starts working on this for now (I planned to do it in the next weeks). First of

Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Jérôme Charron
Jerome: Give me a shout if you need a hand on this. I'll be happy to help and as it happens, I'll be available in the next few weeks. Sébastien, Great! As I mentioned in my last comment on JIRA, please synchronize with Chris on this point. I'm currently coding on other subjects and don't have

Re: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-16 Thread Jérôme Charron
What about a default-plugin as Andrzej proposed. The default plugin mechanism is integrated in the parse-plugins descriptor using the * content-type It should behave like the unix-command strings. Does this make sense? Are you on it too? But we don't planned to develop it Otherwise, I

Re: Classpath for HTML Parser Plugin

2005-09-27 Thread Jérôme Charron
I noticed that HTML-Parser Plugin has references to xercesImpl.jar which is plased in src/plugin/parse-rss/lib/xercesImpl.jar Where do you find some references to xercesImpl .jar in HTML-Parser plugin? (If so, I dont understand how it can compile since the build scripts never import any lib

Re: java.net.MalformedURLException: no protocol for parse-plugins.xml

2005-10-03 Thread Jérôme Charron
Likely missing file:/. If I get rid of lines 617-622 of conf/nutch-default.xml Oups, sorry. I made this last change just after testing the whole patch. And I doesn't test it once again since I was sure it was a minor change. I correct this right now. Sorry. Regards Jérôme --

Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
I think would be neat to have the NutchAnalyzer also as a plugin, from my understanding right now if I want to analyze in a different way, I need to hack the nutch source code, if we are going to have different plugins for different analyzers that will help. Some specific application may use

Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
I read about the MultiLingualSupport, but I didn't see it in the repository, I think is cool. The analyzer extension point is defined by the Analyzer abstract class: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java The default analyzer

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

2005-10-06 Thread Jérôme Charron
There is one potential problem that I see -- Nutch plugins require explicit JAR references. If you want to switch between algorithms you'll need to either put all Carrot2 JARs in the descriptor, put them in CLASSPATH before Nutch starts or do some other trickery with class loading. Only

Re: standard version of log4j

2005-11-07 Thread Jérôme Charron
hmmm.. so that means if we want to customize logging it would be for every plugin potentially? Perhaps a common logger would atleast make some degree of sense. I really think it make sense. When I fixed the issue about plugin dependencies, I began to create a log4j plugin in order to remove

Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron
Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik, I was thinking about it for a while, but don't

Re: Lucene or Nutch

2005-11-10 Thread Jérôme Charron
I would be disappointed by this move - language identifier is an important component in Nutch. Now the mere fact that it's bundled with Nutch encourages its proper maintenance. If there is enough drive in terms of willingness and long-term commitment it would make sense to move it to a

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Jérôme Charron
Do we talk about parsing rdf or do we discuss to store parsed html text in rdf and convert it via xslt to pure text? I may misunderstand something. I very like the idea of a general rdf parser. Back in the days i played around with jena.sf.net Parsing yes, replace nutch sequence file and the

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file

Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Jérôme Charron
Sounds really good (and it is requested by a lot of nutch users!). +1 Jérôme On 12/1/05, Doug Cutting [EMAIL PROTECTED] wrote: Matt Kangas wrote: #2 should be a pluggable/hookable parameter. high-scoring sounds like a reasonable default basis for choosing recrawl intervals, but I'm sure

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
The total number of hits (approx) is 2,780,000,000. BTW, I find it curious that the last 3 or 6 digits always seem to be zeros ... there's some clever guesstimation involved here. The fact that Google Suggest is able to return results so quickly would support this suspicion. For more

Hard-coded Content-type checks

2005-12-13 Thread Jérôme Charron
Hi, I would like to remove all the hard-coded content-type checks spread over all the parse plugins. In fact, the content-type/plugin-id mapping is now centralized in the parse-plugin.xml file, and there's no more needs for the parser to check the content-type. The basic idea was: 1. The

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
+1 A simple solution that provides a standard way to access common meta data. Great! -- http://motrech.free.fr/ http://www.frutch.org/

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Jérôme Charron
+1 for a 0.7.2 release. Here are the issues/revisions I can merge to 0.7 branch. These changes mainly concern the parser-factory changes (NUTCH-88) http://issues.apache.org/jira/browse/NUTCH-112 http://issues.apache.org/jira/browse/NUTCH-135 http://svn.apache.org/viewcvs.cgi?rev=356532view=rev

Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Jérôme Charron
What people think if we collect a list of issues and make a voting iteration? +1

Re: vote results.

2005-12-15 Thread Jérôme Charron
Just continue voting I will continue with my tally sheet. :-) Why not creating a wiki page... so that you don't have to do this bad work. Jérôme

Re: Latest version of Mapred

2005-12-19 Thread Jérôme Charron
Thanks for the fast response, Do you know where I can find a compressed version? Here are the nightly builds: http://cvs.apache.org/dist/lucene/nutch/nightly/ Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Static initializers

2005-12-20 Thread Jérôme Charron
Andrzej, How do you choose the NutchConf to use ? Here is a short discussion I had with Doug about a kind of dynamic NutchConf inside the same JVM: ... By looking at the mailing lists archives it seems that having some behavior depending on the documents URL is a recurrent problem (for instance

Re: no static NutchConf

2006-01-04 Thread Jérôme Charron
Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. Ok, I understand this Andrzej, but it is not really what I call

Re: problems http-client

2006-01-06 Thread Jérôme Charron
A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on? Jérôme -- http://motrech.free.fr/

Re: test suite fails?

2006-01-09 Thread Jérôme Charron
I have the same problem too. I don't understand what happens. In fact, the CommandRunner returns a -1 exit code, but nothing in the error output and the good string in the standard output (nutch rocks nutch rocks nutch rocks). All seems to be ok but the exit code. Jérôme On 1/9/06, Piotr

Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Jérôme Charron
... in fact, not really... really unrelated !!! I remove it immediately. Thanks On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Jérôme Charron
the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); This code works well, because Nutch HTML Parser uses Xerces

Re: ParserFactory test fail

2006-01-10 Thread Jérôme Charron
Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jérôme Charron
I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). It's not really choosen by the languageidentifier, but coosen regarding the value of the lang attribute (for now, that's right, only the languageidentifier add this

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
Any plan to implement this ? I mean move LanguageIdentifier class intto nutch core. As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
+1. Other local modifications which I use frequently: * exporting a list of supported languages, * exporting an NGramProfile of the analyzed text, * allow processing of chunks of input (i.e. LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is very useful if the text to

Re: xml-parser plugin contribution

2006-01-24 Thread Jérôme Charron
Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a new issue and attach the file. Perhaps you can use this already existing issue http://issues.apache.org/jira/browse/NUTCH-23 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

2006-01-24 Thread Jérôme Charron
Is it reasonable to guess language info. from target servers geographical info.? Yes, it could be another clue to guess language. But the problem is then to find how to use all these indices. For instance, the actual solution is the easiest one, but certainly not the more efficient one: For

Re: Cmd line for running plugins

2006-02-02 Thread Jérôme Charron
+1 On 2/1/06, Stefan Groschupf [EMAIL PROTECTED] wrote: +1 Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki: Hi, I just found out that it's not possible to invoke main() methods of plugins through the bin/nutch script. Sometimes it's useful for testing and debugging - I can do it

javaswf.jar

2006-02-06 Thread Jérôme Charron
Hi, It seems that the javaswf.jar lib was builded using jdk 1.5: class file has wrong version 49.0, should be 48.0 Does I missed something, or Nutch should still be compiled using jdk 1.4.x ? Please confirm, so that I can commit a new javaswf.jar builded with jdk 1.4 Regards Jérôme --

Empty Parse

2006-02-09 Thread Jérôme Charron
Hi all, I just notice an inconsistency when there is a parsing failure : 1. The Fetcher return an empty ParseImpl instance (it contains no metadata, especially SEGMENT_NAME_KEY and SIGNATURE_KEY) 2. Then, the Indexer tries to add the fields segment and digest from the metadata keys

Jakarta-POI 3.0-alpha1

2006-02-09 Thread Jérôme Charron
Hi, I have made some experiments with the 3.0-alpha1 version of Jakarta POI (used by parse-msword and parse-mspowerpoint). Since this version contains the hwpf package it enables to parse msword documents too (the actual version in lib-jakarta-poi plugin doesn't contain this package). The benefit

Re: Empty Parse

2006-02-09 Thread Jérôme Charron
Is this happening with the latest code? Yes. But by looking in the svn repository . it is my fault ... sorry (NUTCH-139) I fix that right now. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: duplicate libs

2006-02-14 Thread Jérôme Charron
There are a number of duplicated libs in the plugins, namely: Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196? I have still provided a patch for a log4j lib. If there is no objection, I will commit it and go ahead for * lib-commons-httpclient * lib-nekohtml Jérôme

Re: duplicate libs

2006-02-15 Thread Jérôme Charron
Yes, there is an easier way. Implement a custom task to which you'll pass a path to plugin.xml and a name for a path. The task (Java code) will create a named (id) path object which can be subsequently used in ant with classpath refid=xxx /. This requires a custom ant task, but as you

Re: duplicate libs

2006-02-15 Thread Jérôme Charron
may you will find that interesting also: http://maven.apache.org/using/multiproject.html Thanks Stefan. Maven seems to be a really good project software management tool. But for now, I don't plan to migrate to maven... (I don't have enought knowledge about it and so I don't have a good overview

Re: duplicate libs

2006-02-16 Thread Jérôme Charron
Sounds very good! I may missed - that are you able to extract the dependencies from the plugin.xml without hacking ant? Yes, by using the xmlproperty task: it defines a property for each path found in the xml document ( http://ant.apache.org/manual/CoreTasks/xmlproperty.html ) Jérôme --

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa ge-summary.html org.apache.nutch.parse.pdf (Nutch 0.7.1 API) but I dont see it in the source of 0.7.1 downloaded I see it on cvs here: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s

Re: Nutch Parsing PDFs, and general PDF extraction

2006-02-28 Thread Jérôme Charron
Putting the wellformed version of the plugin code you provided generated the follwong exception: Does the nutch-extensionpoints plugin is activated?

Re: Nutch Parsing PDFs, and general PDF extraction

2006-03-02 Thread Jérôme Charron
This is something google does very well, and something nutch must match to compete. Richard, it seems you are a real pdf guru, so any code contribution to nutch is welcome. ;-) Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: PDF Parse Error

2006-03-02 Thread Jérôme Charron
Yes, but please do not cross-post - many of us are subscribed to both groups, and we're getting multiple copies of your posts... +1 I agree, this is inconsistent and should be changed. I think all places should use -1 as a magic value, because it's obviously invalid. +1 Richard, could you

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-02 Thread Jérôme Charron
Calling compile-core for every plugin makes builds really slow. I was surprised that nobody complain about this... ;-) I think it's safe to assume that the core has already been compiled before plugins are compiled. Don't you? It just ensure that the last modified core version is

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
In a distributed configuration one needs to rebuild the job jar each time anything changes, and hence must check all plugins, etc. So I would appreciate it if this didn't take quite so long. Make sense! Here is my proposal. For each plugin: * Define a target containing core (will be used when

Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java

2006-03-03 Thread Jérôme Charron
Adding DOAP for Nutch. Contributed by Chris Mattmann. Added: lucene/nutch/trunk/site/doap.rdf Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java

Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
On 3/3/06, Doug Cutting [EMAIL PROTECTED] wrote: Jérôme Charron wrote: Here is my proposal. For each plugin: * Define a target containing core (will be used when building single plugin) * Define a target not containing core (will be used when building whole code) I commit this as soon

Re: [jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration

2006-03-09 Thread Jérôme Charron
In fact, my first need was to be able to configure the boost for RawFieldQueryFilter. The idea is then to give to the user a better control of boost values by simply : * add a setBoost(float) method to RawFieldQueryFilter. * (add a setLowerCase(boolean) method to RawFieldQueryFilter) * Add some

Re: quality of search text

2006-03-10 Thread Jérôme Charron
I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Are you sure? Take a look at this search results:

AnalyzerFactory

2006-03-10 Thread Jérôme Charron
It seems that the usage of AnalyzerFactory was removed while porting Indexer to map/reduce. (AnalyzerFactory is no more called in trunk code) Is it intentional? (if no, I have a patch that I can commit, so thanks to confirm) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! Does somebody plans to switch to this lib in nutch? Does

Re: Much faster RegExp lib needed in nutch?

2006-03-12 Thread Jérôme Charron
Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Jérôme Charron
I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. Fixed (r385702). Thanks Chris. NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) It is the

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
Beside that, we may should add a kind of timeout to the url filter in general. Since it can happen that a user configure a regex for his nutch setup that run in the same problem as we had run right now. Something like below attached. Would you agree? I can create a serious patch and test it

Re: Much faster RegExp lib needed in nutch?

2006-03-16 Thread Jérôme Charron
If it were easy to implement all java regex features in dk.brics.automaton.RegExp, then they probably would have. Alternately, if they'd implemented all java regex features, it probably wouldn't be so fast. So I worry that attempts to translate are doomed. Better to accept the differences:

Re: Spelling suggestion for RSS Feed

2006-03-28 Thread Jérôme Charron
I've implemented the spelling correction for the RSS Opensearch feed, hopefully in keeping with the opensearch guidelines. If this format is ok, I'll submit an optional patch alongside the current one at http://issues.apache.org/jira/browse/NUTCH-48 . +1 Jérôme -- http://motrech.free.fr/

Re: Refactoring some plugins

2006-03-29 Thread Jérôme Charron
I don't think it upside down. Plugins should not share packages with core code, since that would permit them to use package-private APIs. Also, re-arranging the code to make the javadoc nice is right, since the javadoc is a primary means of describing the code. Yes, but what I mean is that

Re: Refactoring some plugins

2006-03-31 Thread Jérôme Charron
I'm reluctant to move the extension interface away from the parameter and return value classes used by that interface. I'm reluctant too... I asked, in case someone has a magic idea... Could we instead add a super-interface that all extension-point interfaces extend? That way all of the

Re: Refactoring some plugins

2006-03-31 Thread Jérôme Charron
No, I don't think so. These are strongly related bundles of plugins. When you change one chances are good you'll change the others, so it makes sense to keep their code together rather than split it up. Folks can still find all implementations of an interface in the javadoc, just not always

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Jérôme Charron
PMD looks like a useful such tool: http://pmd.sourceforge.net/ant-task.html I would not be opposed to integrating PMD or something similar into Nutch's build.xml. What do others think? Any volunteers? +1 (Very configurable, very good tool!)

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-06 Thread Jérôme Charron
With code coverage... I don't know. It's up to you guys -- you spend much more time on Nutch code than I do and you know best what is needed and what isn't. My feeling was simply that the closest we are to Nutch-1.0, the more be need some QA metrics (for us and for nutch users). No? Jérôme

Re: PMD integration

2006-04-07 Thread Jérôme Charron
that right now it is checking only main code (without plugins?). Yes, that's correct -- I forgot to mention that. PMD target is hooked up with tests and stops the build if something fails. I thought the core code should be this strict; for plugins we can have more relaxed rules -1 Since

Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-07 Thread Jérôme Charron
My feeling was simply that the closest we are to Nutch-1.0, the more be need some QA metrics (for us and for nutch users). No? I absolutely agree Jérôme, really. It's just that developers usually tend to hook up dozens of QA plugins and never look at what they output (that's the usual

Re: PMD integration

2006-04-07 Thread Jérôme Charron
I will make it totally separate target (so test do not depend on it). +1 The goal is to allow other developers to play with pmd easily but at the same time I do not want the build to be affected. +1 I would like also to look at possibility to generate crossreferenced HTML code from

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Jérôme Charron
Do you guys have any additional insights / suggestions whether NUTCH-240 and/or NUTCH-61 should be included in this release? NUTCH-240 : I really like the idea, but for now, I agree with that is API is still ugly. I would like to help in the next weeks... So for me it should not be included in

Content-Type inconsistency?

2006-04-10 Thread Jérôme Charron
It seems there is an inconsistency with content-type handling in Nutch: 1. The protocol level content-type header is added in content's metadata. 2. The content-type is then checked/guessed while instanciating the Content object and stored in a private field (at this step, the Content object can

Re: [Proposal] New Lucene sub-project

2006-04-10 Thread Jérôme Charron
I found your idea very interesting. I will be interested to contribute to the Parse Plugins Framework. I have developed similar one using Lucene. The project name is Lius. Hi Rida, Yes, I know Lius. It seems very interesting, and I think it would be very interesting too if we can merge our

Re: PMD integration

2006-04-11 Thread Jérôme Charron
Piotr, please keep oro-2.0.8 in pmd-ext I do not agree here - we are going to make a new release next week and releasing with two versions of oro does not look nice. oro is quite stable product and changes are in fact minimal: http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES OK for

Re: Content-Type inconsistency?

2006-04-13 Thread Jérôme Charron
I would like to come back on this issue: The Content object holds two content-types: 1. The raw content-type from the protocol layer (http header in case of http) in the Content's metadata 2. The guessed content-type in a private field content-type. When a ParseData object is created, it takes

Nutch calendar

2006-04-14 Thread Jérôme Charron
Hi all, Just for fun, I have created a public nutch calendar on Google Calendar. You can add it to your Google calendars or acces it via these URLs: Feed URL is : http://www.google.com/calendar/feeds/[EMAIL PROTECTED]/public/basic ICAL URL is : http://www.google.com/calendar/ical/[EMAIL

Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom

2006-04-26 Thread Jérôme Charron
Is this just needed for references from javadoc? If so, then this can be copied to build/docs, no? Yes. Committed. Jérôme

Re: [Nutch-cvs] svn commit: r397320 - /lucene/nutch/trunk/src/plugin/parse-oo/plugin.xml

2006-04-27 Thread Jérôme Charron
parse-oo plugin manifest is valid with plugin.dtd Oops, I didn't catch that... Thanks! No problem Andrzej. It is just a cosmetic change since the plugin.xml are not validated at runtime (it is in my todo list), and the contentType and pathSuffix parameters are more or less deprecated. Jérôme

Re: Content-Type inconsistency?

2006-04-27 Thread Jérôme Charron
Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. Currently, what happens when Content-Type exists in both HTTP layer and in META tag

  1   2   >