Re: End-Of-Life status for 0.7.x?
+1 On Jan 18, 2008 5:22 AM, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Hi all, My opinion is that we should mark it EOL, and close all JIRA issues that are relevant only to 0.7.x, with the status Won't Fix. +1 -- Sami Siren -- Jérôme Charron Directeur Technique @ WebPulse Tel: +33673716743 - [EMAIL PROTECTED] http://blog.shopreflex.com/ Tous les goûts sont dans la nature, les vôtres sont sur http://www.shopreflex.com
Re: log guards
These guards were all introduced by a patch some time ago. I complained at the time and it was promised that this would be repaired, but it has not yet been. Yes, Sorry Doug that's my own fault I really don't have time to fix this :-( Best regards Jérôme
Re: log guards
Hi Chris, The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309 Thanks for your help. Jérôme On 2/13/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Doug, and Jerome, Ah, yes, the log guard conversation. I remember this from a while back. Hmmm, do you guys know what issue that this recorded as in JIRA? I have some free time recently, so I will be able to add this to my list of Nutch stuff to work on, and would be happy to take the lead on removing the guards where needed, and reviewing whether or not the debug ones make sense where they are. Cheers, Chris On 2/13/07 11:17 AM, Jérôme Charron [EMAIL PROTECTED] wrote: These guards were all introduced by a patch some time ago. I complained at the time and it was promised that this would be repaired, but it has not yet been. Yes, Sorry Doug that's my own fault I really don't have time to fix this :-( Best regards Jérôme __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: implement thai language indexing and search
i used an existing ThaiAnalyzer which was in lucene package. ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled and placed all class files in a jar - analysis-th.jar (do i need to bundle the ngp file in the jar as well ?) 1. You don't have to refactor the lucene analyzer. Just to wrap it like I do with french and german analyzers (they both use some analyzers from lucene). 2. Analyzer doesn't need ngp files... I think you misunderstood something: 2.1 In one side there is the language identifier that use NGP files to identify language of a document 2.2 In the other sided if a suitable analyzer is found for the identified language, it is used to analyze the document. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: implement thai language indexing and search
ok. I was able to enable the language identifier plugin by adding the value in plugin.includes attribute in nutch-site.xml - but i'm not sure just by doing that I can have thai text recognized and tokenized properly. What else do I have to do ? Please help me. 1. You must create a thai NGP (Ngram Profile file) so that the language identifier can identify thai ! 2. You must create a thai analyzer (see for instance analysis-fr and analysis-de sample analyzers). Best Regards Jérôme
Re: Content-type detection for Tika
I'm thinking about implementing the (draft) shared MIME database spec [1] from freedesktop.org in Tika as a modern MIME magic implementation to help automatically detect and handle the types of resources where insufficient typing metadata is available. The specified typing information also includes an inheritance model which allows for automatic failover to more generic parsers (e.g. from image/svg to text/xml) when specific parser plugins are not available. I already have such code for Nutch (freedesktop based content-type detection). These days, I have no more time to spend on Nutch, but I can send you the code. Please contact me on my private mail. Regards Jérôme
Re: Antwort: Re: parse-plugins.xml
What you probably mean is something equivalent to Unix strings(1). I have a plugin that implements this, which I could contribute if there's interest. +1 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Error with Hadoop-0.4.0
I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In fact, this method is only used by Injector.inject() and CrawlDb.update() and the inputPath setted in createJob is not needed neither by Injector.inject() nor CrawlDb.update() methods. If no objection, I will commit this change tomorrow. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Error with Hadoop-0.4.0
Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) By looking at the Nutch code, and simply changing the line 145 of Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) all is working fine. By taking a closer look at CrawlDb code, I finaly dont understand why there is the following line in the createJob method: job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); For curiosity, if a hadoop guru can explain why there is such a regression... Does somebody have the same error? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Possible memory leak?
It seems to be a side effect of NUTCH-169 (remove static NutchConf). Prior to this, the language identifier was a singleton. I think we should cache its instance in the conf as we do for many others objects in Nutch. Enrico, could you please create a JIRA issue. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: noindedo not index/noindex
as far I can see nutch's html parser does only support the meta tag noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there is an inoffiziel html noindex tag. http://www.webmasterworld.com/forum10003/2703.htm Hello Stefan, Here is a previous discussion about this : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg04576.html May be this would be another thing to make nutch more polite. +1 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ java/org/apach
I don't think guards should be added everywhere. That's right Doug. It was a rude first pass on logging. The next pass (finest) will be done with NUTCH-310. Rather, guards should only be added in performance critical code, and then only for Debug-level output. Info and Warn levels are normally enabled, and developers should thus not log messages at these levels so frequently that performance will be compromised. Yes, but that's actually not the case in Nutch : The major part of logging statements are using Info level. And not all Debug-level log statements need guards, only those that are in inner loops, where the construction of the log message may significantly affect performance. I plan to review all the logging statements while working on NUTCH-310, and I will then follow your directions. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Nutch-cvs] svn commit: r414681 - /lucene/nutch/trunk/src/java/org/apache/nutch/protocol/ProtocolFactory.java
I'm somewhat worried about the possible clash in the conf name-space - usually, when we store Object's in Configuration instance, we use their full class name, or at least a long and most probably unique string. In this case, we use just http, https, ftp, file and so on ... Would it make sense if in this special case we used the X_POINT + protocolName as the unique string? +1 (why not using directly the extension id ?) -- http://motrech.free.fr/ http://www.frutch.org/
Re: [jira] Resolved: (NUTCH-303) logging improvements
There seems to be two log4j.properties files in generated war, is this intentional? Not intentional. A side effect. In fact, the first one is the one that comes from conf dir (I will exlude it in war so that it will be clearier). The second one (that override the first one) is the good one that comes from web directory. However it works just fine. That's a good news. Sami, I have not made changes to web2. Do you want that I switch web2 to Commons Logging? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Nutch logging questions
Hi, I'm currently working on NUTCH-303 so that nutch uses commons logging facade API and log4j as the default implementation. All the code is actually switched to and uses Commons Logging API, and I have replaced some System.out and printStackTrace to make use of Commons Logging. To finalize this patch, my problem is on the configuration: 1. Does the back-end and front-end should have the same logging configuration? 2. What kind of configuration do you think is the best one by default? For now, I have used the same log4 properties than hadoop (see http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 ) for the back-end, and I was thinking to use the stdout for front-end. What do you think about this? 3. When using the default DRFA appender (Daily Rolling File Appender) in nutch, should I log in the the hadoop log file or in a nutch file? Thanks for your feed-back. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Status of language plugin
Is there an API doc or design doc that I can read to understand where you are? Is the language plugin architecture already in the main trunk? The only available document is http://wiki.apache.org/nutch/MultiLingualSupport and sometimes I maintain this page http://wiki.apache.org/nutch/JeromeCharron Here are some issues that I've been worried about: * Support of multilingual plugin? ** If one plugin can support more than one languages, the language needs to be passed at each analyzsis. I don't understand your need. But if you have an analysis plugin that can handle many languages, you can simply define many implementations in your plugin xml, ie extension id=org.apache.nutch.analysis.cjk name=CJKAnalyzer point=org.apache.nutch.analysis.NutchAnalyzer implementation id=org.apache.nutch.analysis.cn.ChineseAnalyzer class=org.apache.nutch.analysis.cjk.CJKAnalyzer parameter name=lang value=cn/ /implementation implementation id=org.apache.nutch.analysis.kr.KoreanAnalyzer class=org.apache.nutch.analysis.cjk.CJKAnalyzer parameter name=lang value=kr/ /implementation implementation id=org.apache.nutch.analysis.jp.JapaneseAnalyzer class=org.apache.nutch.analysis.cjk.CJKAnalyzer parameter name=lang value=jp/ /implementation /extension ** This assumes language identification is done before analysis. Is it the case ? Yes. * Support of a different analyzer for query than index ** Analyzer for query may need to behave differently than analyzer for indexinging. Can your architecture specify different analyzers for indexing and query? In fact, to avoid adding a QueryAnalyser extension point, the Query use the same Analyzer implementation that the one for document analysis. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar
URL: http://svn.apache.org/viewvc?rev=411943view=rev Log: Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons Logging, configured for log4j by default. If log4j is now included in the core, we can remove the lib-log4j plugin. If no objection, I will doing it. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar
As far I understand hadoop use commons logging. Should we switch to use commons logging as well? Why not... (but using commons logging doesn't exclude to have a default implementation, such as log4j used by hadoop).
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
You're right -- changing anything with the input (snippets length, number of documents etc) will alter the clusters. This is basically how it works. If you want clustering in your search engine then, depending on the type of data you serve, you'll have to experiment with the settings a bit and see which give you satisfactory results. I don't think there is any particular reason to provide different data to the clusterer. Moreover, it'd complicate things quite badly. Thanks Dawid for your response. In fact, I don't really want to change this, but just to be sure that everybody is aware about it and to have some opinions. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Add 3. Clustering would benefit from a plain text version. Yes Dawid, but it is already committed = the clustering now uses the plain text version returned by the toString() method. Dawid, I have a question about clustering. Actually, the clustering uses the summaries as input. I assumes it would provides some better results if it takes the whole documents content. no? I assumes that clustering uses the summaries instead of documents content for some performances purpose. But there is a (bad) side effect : since the size of the summaries is configurable, the clustering quality will vary depending on the summaries size configuration. I really found this very confusing : when folks adjust this parameter it is only for front-end consideration (they want to display a long or a short summary), but certainly not for clustering reasons. What you and others thinks about this? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
(but if the nutch-site.xml overrides the plugin.include property and doen't include it it will not be activated, like any other plugin) yes, that's what I ment, I quess that's the default case for people hacking plugins. Oh, yes Sami, I understand what you mean... Sorry, I just forgot to mention this point on the list (so, plugins hackers, you need to add one of the new summary plugin if you want to have some summaries displayed). Sorry, I forgot too to add summary plugins in the default webapp context file (nutch.xml) ... I will add this once the svn write access will be available. And one more time sorry, because I forgot too to report summary APIs changes to web2 module... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Nutch-dev] Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
Bob Carpenter of alias-i had this to say when I brought up this very idea: http://article.gmane.org/gmane.comp.jakarta.lucene.devel/12599 Thanks for you response Marvin. But finally my question is : shouldn't the nutch clustering uses some fixed size snippets instead of the configurable displayed size? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
This means there's no markup in the OpenSearch output? Yes, no markup for now. Shouldn't there be? The restriction on description field is : Can contain simple escaped HTML markup, such as b, i, a, and img elements. So, ya, why not. We can add b around highlights. What you and others thinks? Perhaps this should be a method on Summary, to render it as html? I had some hesitations about this while coding In fact, as suggested in the issue's comments, I would like to add a generic method on Summary : String toString(Encoder, Formatter) like in the Lucene's Highlighter and provide some basic implementations of Encoder and Formatter. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/
String toString(Encoder, Formatter) like in the Lucene's Highlighter and provide some basic implementations of Encoder and Formatter. That sounds fine, but in the meantime, let's not reproduce the html-specific code in lots of places. We need it in both search.jsp and in OpenSearchServlet.java. So we should have it in a common place. A method on Summary seems like a good place. If we subsequently add a more general API then we could re-implement the toHtml() method using that API, but I think a generic toHtml() method will be useful for quite a while yet. Yes Doug, but in fact, the idea is to add the toString(Formatter) method in a common place (Summary). And add one specific Formatter implementation for OpenSearch and another one for search.jsp : The reason is that they should not use the same HTML code : 1. OpenSearch should only use b around highlights 2. search.jsp should use some more complicated HTML code (span ... ) In fact, I don't know if the Formatter solution is the good one, but the toString() or toHtml() must be parametrized since the two pieces of code that use this method should have distinct outputs. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: http chunked content
As far I know a lot of http servers response with chunked content at least all that return dynamically generated pages. Should I file a bug? Any thoughts? In fact, the requests issued from http plugin are in HTTP 1.0, so the servers should never return some chuncked content. I think that the readChunkedContent was included in the code for a future use. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Feature idea - Indexing Text Lengths
Sorry i cant give more then an idea, I'm not a java developer, but I think the idea could prove useful. The idea is to limit the length of sentences that get entered into the index. So, after parsing a page, and words that don't make what appears to be a complete sentence get ignored. Douglas, Here is a previous discussion about this subject on the list: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg03070.html Take a look at this thread.. this problem is not so easy. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
I'm not so sure. When crawling Apache we had trouble with this feature. Some HTML files that had an XML header and the server identified as text/html Nutch decided to treat as XML, not HTML. Yes, the current version of the mime-type resolver is a crude one. XML, HTML, RSS and all XML based files are not always correctly identified. (this problem is well known, and cause troubles for instance with RSS feeds that return text/xml content-type). We had to turn off the guessing of content types to index Apache correctly. Instead of turning off the guessing of content types you should only to remove the magic for xml in mime-types.xml In the new version (based on freedesktop) that is sleeping for a while on my disk, I think such problems are solved since it introduce many informations not included in the current version: hierarchy between content-types (text/html is a subclass of text/xml), some way to express some complex magic clause, and so on. For instance, it can now correctly identify RSS documents : generally RSS feeds are associated with a generic text/xml content-type, and we cannot identify them = they fall back to the generic parse-text parser. I think we shouldn't aim guess things any more than a browser does. If browsers require standards compliance, then our lives will be simpler. Yes, but actually Nutch cannot acts as a browser. For instance with RSS: A browser know that a URL is a RSS feed because there is a link rel=alternate type=.../ with the correct content-type (application/rss+xml) in the refering HTML page. Nutch doesn't keep such informations for guessing a content-type (it could be a good think to add), so it must find the content-type from the URL (without any context). Since all servers simply return the generic text/xml content-type, the only way to know it is a rss related document is to use magic content-type guessing (you can notice that many browsers doesnt identify it as a rss document, but simply as a generic xml file). One more thing is that actually, there is no officialy registered content-type for rss. So, we can only use guessing from the document content to know it is a rss document. Jérôme
Re: [Nutch-cvs] svn commit: r397320 - /lucene/nutch/trunk/src/plugin/parse-oo/plugin.xml
parse-oo plugin manifest is valid with plugin.dtd Oops, I didn't catch that... Thanks! No problem Andrzej. It is just a cosmetic change since the plugin.xml are not validated at runtime (it is in my todo list), and the contentType and pathSuffix parameters are more or less deprecated. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
Are you mainly concerned with charset in Content-Type? Not specifically. But while looking at these content-type inconsistency, I noticed that there is some prossible troubles with charset in content-type. Currently, what happens when Content-Type exists in both HTTP layer and in META tag (if contents is HTML)? We cannot use the one in Meta-tags : to extract it, we first need to know to use the html parser. Only the HTTP header is used. It is then checked/guessed using the mime-type repository (it is a mime-type database that contains mime-type and associated file extensions and optionaly some magic-bytes). How does Nutch guesses Content-Type, and when does it need to do that? See my response above Is there a situation where the guessed content-type differs from the content-type in the metadata? From the one in headers : yes (mainly when the server is badely configured) Here is an easy way to reproduce what I mean by content-type inconsistency: 1. Perform a crawl of the following URL : http://jerome.charron.free.fr/nutch/fake.zip (fake.zip is a fake zip file, in fact it is a html one) 2. While crawling, you can see that the content-type returned by the server is application/zip 3. But you can see that Nutch correctly guess the content-type to text/html (it uses the HtmlParser) 4. At this step, all is ok. 5. Then start your tomcat and try the following search : zip 6. You can see the fake.zip file in results. Click on details ; if the index-more plugin was activated then you can see that the stored content-type is application/zip and not text/html What I suggest is simply to use the content-type used by nutch to find which parser to use instead of the one returned by the server. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
I'm not sure if that is the right thing. If the site administrator did a poort job and a wrong media type is advertized, it's the site problem and Nutch shouldn't be fixing it, in my opinion. Those sites would not work properly with the browsers any way, and Nutch doesn't need to work properly except that it should protect itself from crashing. I tried to visit your fake.zip page with IE and Firefox, and both faithfully trusted the media type as advertised by the server, and asked me if I want to open it with WinZip or save it; there was no option to open it as an HTML. Why should Nutch treat it as HTML? Simply because it is a HTML file, with a strange name, of course, but it is a HTML file. My example is a kind of caricature. But some more real case could be : a HTML file with a text/plain content-type, or with an text/xml Finaly it is a good news that Nutch seems to be more intelligent on content-type guessing than Firefox or IE, no? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom
Is this just needed for references from javadoc? If so, then this can be copied to build/docs, no? Yes. Committed. Jérôme
Nutch calendar
Hi all, Just for fun, I have created a public nutch calendar on Google Calendar. You can add it to your Google calendars or acces it via these URLs: Feed URL is : http://www.google.com/calendar/feeds/[EMAIL PROTECTED]/public/basic ICAL URL is : http://www.google.com/calendar/ical/[EMAIL PROTECTED]/public/basic Anybody is welcome to edit this calendar. Just contact me so that I add you to the list of editors. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Content-Type inconsistency?
I would like to come back on this issue: The Content object holds two content-types: 1. The raw content-type from the protocol layer (http header in case of http) in the Content's metadata 2. The guessed content-type in a private field content-type. When a ParseData object is created, it takes only the Content's metadata. So, the ParseData can only access the raw content type and not the one guessed. What I suggest is : 1. add a content-type parameter in the ParseData constructors (so that Parsers can pass the guessed content-type to ParseData). 2. The Content object stores the guessed content-type in it's metadata in a special attribute named for instance GUESSED_CONTENT_TYPE, so that the ParseData can access it I think 1. is really cleanest way to implement this, but there is a lot of code impacted = all the parsers. Solution 2. have no impact on APIs, so the code changes are very small. Suggestions? Comments? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: PMD integration
Piotr, please keep oro-2.0.8 in pmd-ext I do not agree here - we are going to make a new release next week and releasing with two versions of oro does not look nice. oro is quite stable product and changes are in fact minimal: http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES OK for me. But we cannot make a release without minimal tests. (I will made some tests for removing oro from nutch's regex for post 0.8release) Jérôme
Content-Type inconsistency?
It seems there is an inconsistency with content-type handling in Nutch: 1. The protocol level content-type header is added in content's metadata. 2. The content-type is then checked/guessed while instanciating the Content object and stored in a private field (at this step, the Content object can have 2 different content-types). 3. The Content's private field for content-type is used to find the good parser. 4. Once the Parse object is constructed, the Content is no more used (= the guessed content-type is lost) 5. Then the index-more plugin index the raw content-type and not the guessed one 6. As a consequence the content-type displayed in more.jsp is the raw one, and the one used to query on type is the raw one too. Wouldn't it be better to always use the guessed content-type all along the process? (except in cache.jsp, where the raw one should be used) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Proposal] New Lucene sub-project
I found your idea very interesting. I will be interested to contribute to the Parse Plugins Framework. I have developed similar one using Lucene. The project name is Lius. Hi Rida, Yes, I know Lius. It seems very interesting, and I think it would be very interesting too if we can merge our efforts to a common lucene's sub project (but for the moment, it seems that the tika project doesn't cause a lot of interest...?) If you are interested please let me know. If nutch-dev are interested to create such a project, you are welcome. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: PMD integration
that right now it is checking only main code (without plugins?). Yes, that's correct -- I forgot to mention that. PMD target is hooked up with tests and stops the build if something fails. I thought the core code should be this strict; for plugins we can have more relaxed rules -1 Since plugins provides a lot of Nutch functionalities (without any plugin, Nutch provides no service), I think that plugins code should be as strict as the core code. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Add .settings to svn:ignore on root Nutch folder?
My feeling was simply that the closest we are to Nutch-1.0, the more be need some QA metrics (for us and for nutch users). No? I absolutely agree Jérôme, really. It's just that developers usually tend to hook up dozens of QA plugins and never look at what they output (that's the usual scenario with Maven-built projects that I observed). Yes, that's right...;-) What I think we need is a QA _person_ rather than just tools. But I'm always a bit skeptical, don't take it personally ;) I absolutely agree Dawid. But I don't think Nutch has enought human resources to have a QA person. I will make a try to integrate a code coverage tool, and see if it gives us some good indices on unit tests needed efforts. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: PMD integration
I will make it totally separate target (so test do not depend on it). +1 The goal is to allow other developers to play with pmd easily but at the same time I do not want the build to be affected. +1 I would like also to look at possibility to generate crossreferenced HTML code from Nutch sources as it looks like pmd can use it and violation reports would be much easier to read. +1 Thanks Piotr (and Dawid too of course) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: 0.8 release schedule (was Re: latest build throws error - critical)
Do you guys have any additional insights / suggestions whether NUTCH-240 and/or NUTCH-61 should be included in this release? NUTCH-240 : I really like the idea, but for now, I agree with that is API is still ugly. I would like to help in the next weeks... So for me it should not be included in the 0.8 release... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Add .settings to svn:ignore on root Nutch folder?
With code coverage... I don't know. It's up to you guys -- you spend much more time on Nutch code than I do and you know best what is needed and what isn't. My feeling was simply that the closest we are to Nutch-1.0, the more be need some QA metrics (for us and for nutch users). No? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Add .settings to svn:ignore on root Nutch folder?
PMD looks like a useful such tool: http://pmd.sourceforge.net/ant-task.html I would not be opposed to integrating PMD or something similar into Nutch's build.xml. What do others think? Any volunteers? +1 (Very configurable, very good tool!)
Re: Refactoring some plugins
I'm reluctant to move the extension interface away from the parameter and return value classes used by that interface. I'm reluctant too... I asked, in case someone has a magic idea... Could we instead add a super-interface that all extension-point interfaces extend? That way all of the extension points would be listed in javadoc as implementations of this interface. +1 ... Committed. One more question about javadoc (I hope the last one): Do you think it makes sense to split the plugins gathered into the Misc group into many plugins (such as index-more / query-more), so that each sub-plugin can be dispatched into proper Group. Another solution could be to use in these plugins different packages for each extension it provides. For instance, for the language-identifier plugin, we can be split it in the following plugins: * language-identifier * parse-lang * index-lang * query-lang Or simply refactor it into the following packages: org.apache.nutch.analysis.lang org.apache.nutch.parse.lang org.apache.nutch.indexer.lang org.apache.nutch.searcher.lang Jérôme
Re: Refactoring some plugins
No, I don't think so. These are strongly related bundles of plugins. When you change one chances are good you'll change the others, so it makes sense to keep their code together rather than split it up. Folks can still find all implementations of an interface in the javadoc, just not always grouped together in the table of contents. So, we agreed. We could instead of calling these misc call them compound plugins or something. We can change the package.html for each to list the coordinated set of plugins they provide. For example, language-identifier's could say something like, Includes parse, index and query plugins to identify, index and make searchable the identified language. I plan to review all the package.html ... I will include those changes. Thanks! Jérôme
Re: Refactoring some plugins
I don't think it upside down. Plugins should not share packages with core code, since that would permit them to use package-private APIs. Also, re-arranging the code to make the javadoc nice is right, since the javadoc is a primary means of describing the code. Yes, but what I mean is that it is stange that it is a documentation issue that raise this need for refactoring. Moreover, I would like to suggest some other javadoc improvements (?): 1. Create a group for abstract plugins (like lib-http or lib-regex-filter) named for instance Plugins API 2. Create a group for extensions points (As far as I remember, one of the first problem when you want to extend nutch is to found where are the hooks, ie what are the extensions points). One more time, since the javadoc groups are filtered by packages, each extension point interface must be moved to specific package. The idea is then to move all the core extensions points to a new package (for instance org.apache.nutch.api). 3. Create many javadoc plugins groups (one for each major kind of plugin : Indexing, Parsing, Protocol, Query, UrlFilter and Misc for those that cannot be categorized). Thanks for your suggestions and comments. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Spelling suggestion for RSS Feed
I've implemented the spelling correction for the RSS Opensearch feed, hopefully in keeping with the opensearch guidelines. If this format is ok, I'll submit an optional patch alongside the current one at http://issues.apache.org/jira/browse/NUTCH-48 . +1 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Beside that, we may should add a kind of timeout to the url filter in general. Since it can happen that a user configure a regex for his nutch setup that run in the same problem as we had run right now. Something like below attached. Would you agree? I can create a serious patch and test it if we are interested to add this as a fail back into the sources. +1 as a short term solution. In the long term, I think we should try to reproduce it and analyze what really happen. (I will commit some minimal unit test in the next few days). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
If it were easy to implement all java regex features in dk.brics.automaton.RegExp, then they probably would have. Alternately, if they'd implemented all java regex features, it probably wouldn't be so fast. So I worry that attempts to translate are doomed. Better to accept the differences: if you want the speed, you must use restricted regexes. That's right. It is a deterministic API = more speed, but less functionality. 3. Add new plugins that use dk.brics.automaton.RegExp, using different default regex file names. Then folks can, if they choose, configure things to use these faster regex libraries, but only if they're willing to write the simpler regexes that it supports. If, over time, we find that the most useful regexes are easily converted, then we could switch the default to this. +1 I will doing it this way. Thanks Doug. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Null Pointer exception in AnalyzerFactory?
I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. Fixed (r385702). Thanks Chris. NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) It is the right thing to do. Cheers, Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
It's not only faster, it also scales better for large and complex expressions, it is also possible to build automata from several expressions with AND/OR operators, which is the use case we have in regexp-utlfilter. It seems awesome! Does somebody plans to switch to this lib in nutch? Does the BSD license is compatible with ASF one? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Much faster RegExp lib needed in nutch?
Thanks for volunteering, you're welcome ... ;-) Good job Andrzej !;-) So, That's now in my todo list to check the perl5 compatibility issue and to provide some benchs to the community... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: quality of search text
I think algortihm # 1 is what google uses. google ignores content that does not change from page to page, as well as content that isn't part of a pblock of text. Are you sure? Take a look at this search results: http://www.google.com/search?hl=enhs=otTlr=c2coff=1safe=offclient=firefox-arls=org.mozilla:en-US:officialpwst=1q=+site:gamingalmanac.com+global+gaming+almanac ... and you will notice that menus are indexed by google and displayed in summaries. But if you can contribute a HtmlParseFilter with ability to remove menus and navigation, it will be a real improvement. A first step, that I have developed in a previous project many years ago is to remove pages that contains textual content only in links: it avoid indexing frames or iframes that only contains some navigation text... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
AnalyzerFactory
It seems that the usage of AnalyzerFactory was removed while porting Indexer to map/reduce. (AnalyzerFactory is no more called in trunk code) Is it intentional? (if no, I have a patch that I can commit, so thanks to confirm) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [jira] Closed: (NUTCH-227) Basic Query Filter no more uses Configuration
In fact, my first need was to be able to configure the boost for RawFieldQueryFilter. The idea is then to give to the user a better control of boost values by simply : * add a setBoost(float) method to RawFieldQueryFilter. * (add a setLowerCase(boolean) method to RawFieldQueryFilter) * Add some configuration properties for boost values for actual RawFieldQueryFilters: (CC|Type|RelTag|Site|Language)QueryFilter Do you think it makes sense to commit such changes? (or is it just a very focused need I actually have) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-
In a distributed configuration one needs to rebuild the job jar each time anything changes, and hence must check all plugins, etc. So I would appreciate it if this didn't take quite so long. Make sense! Here is my proposal. For each plugin: * Define a target containing core (will be used when building single plugin) * Define a target not containing core (will be used when building whole code) I commit this as soon as possible. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java
Adding DOAP for Nutch. Contributed by Chris Mattmann. Added: lucene/nutch/trunk/site/doap.rdf Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDbReader.java lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java lucene/nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java It seems that NUTCH-143 patch has been commited too... is it intentional? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-
On 3/3/06, Doug Cutting [EMAIL PROTECTED] wrote: Jérôme Charron wrote: Here is my proposal. For each plugin: * Define a target containing core (will be used when building single plugin) * Define a target not containing core (will be used when building whole code) I commit this as soon as possible. That sounds perfect. Thanks! Committed. Quick benchs: * Before : around 70s * After : around 50s Better, but not so perfect... :-( Jérôme
Re: Nutch Parsing PDFs, and general PDF extraction
This is something google does very well, and something nutch must match to compete. Richard, it seems you are a real pdf guru, so any code contribution to nutch is welcome. ;-) Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: PDF Parse Error
Yes, but please do not cross-post - many of us are subscribed to both groups, and we're getting multiple copies of your posts... +1 I agree, this is inconsistent and should be changed. I think all places should use -1 as a magic value, because it's obviously invalid. +1 Richard, could you please create a jira issue about this. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-
Calling compile-core for every plugin makes builds really slow. I was surprised that nobody complain about this... ;-) I think it's safe to assume that the core has already been compiled before plugins are compiled. Don't you? It just ensure that the last modified core version is automatically compiled while compiling a single plugin. From my point of view the time for a whole build is not a problem. If I just work on core, then I can use the fast compile-core target. And if I just work on a plugin, I only compile the plugin. Finally I use the global compilation very rarely. But perhaps that's not your case, and so it makes sense to reduce time of the whole build. Jérôme
Re: Nutch Parsing PDFs, and general PDF extraction
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa ge-summary.html org.apache.nutch.parse.pdf (Nutch 0.7.1 API) but I dont see it in the source of 0.7.1 downloaded I see it on cvs here: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s rc/java/net/nutch/parse/pdf/ First of all, the nutch source code is no more hosted on sourceforge, but on apache: http://svn.apache.org/viewcvs.cgi/lucene/nutch/ The classes packages has also been changed to org.apache.nutch but my nutch doesn't seem to run the pdf parse class as my log file shows it fecthing pdfs, but saying nutch is unable to parse content type application/pdf Why is this? Was it left out because of performace? Do you have activated the parse-pdf plugin in conf/nutch-default.xml or conf/nutch-site.xml ? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Nutch Parsing PDFs, and general PDF extraction
Putting the wellformed version of the plugin code you provided generated the follwong exception: Does the nutch-extensionpoints plugin is activated?
Re: duplicate libs
Sounds very good! I may missed - that are you able to extract the dependencies from the plugin.xml without hacking ant? Yes, by using the xmlproperty task: it defines a property for each path found in the xml document ( http://ant.apache.org/manual/CoreTasks/xmlproperty.html ) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: duplicate libs
Yes, there is an easier way. Implement a custom task to which you'll pass a path to plugin.xml and a name for a path. The task (Java code) will create a named (id) path object which can be subsequently used in ant with classpath refid=xxx /. This requires a custom ant task, but as you mentioned foreach is also a separate library, so I don't see a huge disadvantage. Carrot2 codebase contains similar fuctionality in carrot2-ant-extensions module, although it should be trivial to implement it from scratch. Thanks Dawid for all these informations. I really prefer your proposed way. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: duplicate libs
may you will find that interesting also: http://maven.apache.org/using/multiproject.html Thanks Stefan. Maven seems to be a really good project software management tool. But for now, I don't plan to migrate to maven... (I don't have enought knowledge about it and so I don't have a good overview of it). Regards Jérôme
Re: duplicate libs
There are a number of duplicated libs in the plugins, namely: Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196? I have still provided a patch for a log4j lib. If there is no objection, I will commit it and go ahead for * lib-commons-httpclient * lib-nekohtml Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Empty Parse
Hi all, I just notice an inconsistency when there is a parsing failure : 1. The Fetcher return an empty ParseImpl instance (it contains no metadata, especially SEGMENT_NAME_KEY and SIGNATURE_KEY) 2. Then, the Indexer tries to add the fields segment and digest from the metadata keys (SEGMENT_NAME_KEY and SIGNATURE_KEY) to the document. Unforunately these values are null, a NPE is thrown and the process failed. My question is : what behaviour is expected in such case? 1. Fetcher must add the SEGMENT_NAME and SIGNATURE metadata in empty ParseImpl? 2. The Indexer must ignore documents without SEGMENT_NAME and SIGNATURE? 3. Both? My feeling is 3, but I prefer that we discuss this point before committing... Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Jakarta-POI 3.0-alpha1
Hi, I have made some experiments with the 3.0-alpha1 version of Jakarta POI (used by parse-msword and parse-mspowerpoint). Since this version contains the hwpf package it enables to parse msword documents too (the actual version in lib-jakarta-poi plugin doesn't contain this package). The benefit is that we can remove the poi-2.1 jars bundled with parse-msword and simply add a dependency to the lib-jakarta-poi plugin (like for parse-mspowerpoint) : Just one version of POI libs is bundled in Nutch. I had performed some tests on a lot of zipped doc files (cool to test two plugins at the same time) from the 3GPP site and all is working fine. I do not perform a lot of tests on powerpoints, but unit tests are ok. If there is no objection, I will commit changes by the end of the week. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Empty Parse
Is this happening with the latest code? Yes. But by looking in the svn repository . it is my fault ... sorry (NUTCH-139) I fix that right now. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
javaswf.jar
Hi, It seems that the javaswf.jar lib was builded using jdk 1.5: class file has wrong version 49.0, should be 48.0 Does I missed something, or Nutch should still be compiled using jdk 1.4.x ? Please confirm, so that I can commit a new javaswf.jar builded with jdk 1.4 Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Cmd line for running plugins
+1 On 2/1/06, Stefan Groschupf [EMAIL PROTECTED] wrote: +1 Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki: Hi, I just found out that it's not possible to invoke main() methods of plugins through the bin/nutch script. Sometimes it's useful for testing and debugging - I can do it from within Eclipse, because I have all plugins on the classpath, but from the command-line it's not possible - in the code they are accessed through PluginRepository. So I added this: public static void main(String[] args) throws Exception { NutchConf conf = new NutchConf(); PluginRepository repo = new PluginRepository(conf); // args[0] - plugin ID PluginDescriptor d = repo.getPluginDescriptor(args[0]); if (d == null) { System.err.println(Plugin ' + args[0] + ' not present or inactive.); return; } ClassLoader cl = d.getClassLoader(); // args[1] - class name Class clazz = Class.forName(args[1], true, cl); Method m = clazz.getMethod(main, new Class[]{args.getClass()}); String[] subargs = new String[args.length - 2]; System.arraycopy(args, 2, subargs, 0, subargs.length); m.invoke(null, new Object[]{subargs}); } It works rather nicely. If other people find it useful, I can add this to PluginRepository. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net -- http://motrech.free.fr/ http://www.frutch.org/
Re: xml-parser plugin contribution
Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a new issue and attach the file. Perhaps you can use this already existing issue http://issues.apache.org/jira/browse/NUTCH-23 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: lang identifier and nutch analyzer in trunk
Is it reasonable to guess language info. from target servers geographical info.? Yes, it could be another clue to guess language. But the problem is then to find how to use all these indices. For instance, the actual solution is the easiest one, but certainly not the more efficient one: For HTML documents, the HTMLLanguageParser scans HTML documents looking at possible indications of content language: 1. html lang attribute 2. meta dc.language 3. meta http-equiv The first one found is assumed to be the document's language. Then if no language is found, the statistical language identifier is used Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: lang identifier and nutch analyzer in trunk
Any plan to implement this ? I mean move LanguageIdentifier class intto nutch core. As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but there were some objections about this... Here is a short status of what I have in mind for next improvements with the LanguageIdentifier / MultiLanguage support : * Enhance LanguageIdentifier APIs by returning something like an ordered LangDetail[] array when guessing language (each LangDetail should contains the language code and its score) - I have a prototype version of this on my disk but I doesn't take time to finalize it * I encountered some identification problems with some specific sites (with blogger for instance), and I plan to investigate on this point. * Another pending task : the analysis (and coding) of multilingual querying support. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: lang identifier and nutch analyzer in trunk
+1. Other local modifications which I use frequently: * exporting a list of supported languages, * exporting an NGramProfile of the analyzed text, * allow processing of chunks of input (i.e. LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is very useful if the text to be analyzed is already present in memory, and the choice of sections (chunks) is made elsewhere, e.g. for documents with clearly outlined sections, or for multi-language documents. Thanks for these intereseting comments Andrzej = I add them to my todo list. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: lang identifier and nutch analyzer in trunk
I am wondering Analyzer of nutch in svn trunk is chosen by languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). It's not really choosen by the languageidentifier, but coosen regarding the value of the lang attribute (for now, that's right, only the languageidentifier add this attribute). In org.apache.nutch.indexer.Indexer.class line 104 writer.addDocument((Document)((ObjectWritable)value).get()); It should be NutchAnalyzer analyzer = AnalyzerFactory.get(doc.get(lang)); writer.addDocument((Document)((ObjectWritable)value).get(), analyzer ); right? Yes, it should. Thanks for noticing this. Merge problem? (I don't remember to add this in nutch-0.7 ...) Once more,query parsing should call AnalyzerFactory?? The query input is multi-lingual also. The query part is not yet implemented. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: HTMLMetaProcessor a bug?
the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); This code works well, because Nutch HTML Parser uses Xerces implementation HTMLDocumentImpl object that lowercased attributes (instead of elements names that are uppercased). For consistency and to decouple a little Nutch HTML Parser and Xerces implementation, I suggest to change these lines by something like: Node nameNode = null; Node equivNode = null; Node contentNode = null; for (int i=0; iattrs.getLength(); i++) { Node attr = attrs.item(i); String attrName = attr.getNodeName().toLowerCase(); if (attrName.equals(name)) { nameNode = attr; } else if (attrName.equals(http-equiv)) { equivNode = attr; } else if (attrName.equals(content)) { contentNode = attr; } } Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: ParserFactory test fail
Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: test suite fails?
I have the same problem too. I don't understand what happens. In fact, the CommandRunner returns a -1 exit code, but nothing in the error output and the good string in the standard output (nutch rocks nutch rocks nutch rocks). All seems to be ok but the exit code. Jérôme On 1/9/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: It fails on my machine on parse-ext tests. I am not sure what is causing it yet and I am afraid I do not have time to investigate it today - maybe in few days. I did a small change to make it compile a few days ago, but all tests went ok before I committed it. Regards Piotr Stefan Groschupf wrote: Hi, is anyone able to run the test suite without any problems? Stefan --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net -- http://motrech.free.fr/ http://www.frutch.org/
Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/
... in fact, not really... really unrelated !!! I remove it immediately. Thanks On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: --- lucene/nutch/trunk/src/plugin/build.xml (original) +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 @@ -6,13 +6,14 @@ !-- Build deploy all the plugin jars.-- !-- == -- target name=deploy - !--ant dir=analysis-de target=deploy/-- - !--ant dir=analysis-fr target=deploy/-- + ant dir=analysis-de target=deploy/ + ant dir=analysis-fr target=deploy/ Was this change intentional? It looks unrelated. Otherwise, this looks great! Doug -- http://motrech.free.fr/ http://www.frutch.org/
Re: problems http-client
A related issue is that these two plugins replicate a lot of code. At some point we should try to fix that. See: http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: no static NutchConf
Excuse me in advance, I probably missed something, but what are the use cases for having many NutchConf instances with different values? Running many different tasks in parallel, each using different config, inside the same JVM. Ok, I understand this Andrzej, but it is not really what I call a use case. It is more a feature that you describe here. In fact, what I mean is that I don't understand in which cases it will be usefull. And I don't understand how a particular NutchConfig will be selected for a particular task... Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Static initializers
Andrzej, How do you choose the NutchConf to use ? Here is a short discussion I had with Doug about a kind of dynamic NutchConf inside the same JVM: ... By looking at the mailing lists archives it seems that having some behavior depending on the documents URL is a recurrent problem (for instance for boosting documents matching a url pattern - NUTCH-16 issue, and many other topics). So, our idea is to provide a way to provide a dynamic nutch configuration (that override the default one, like for the nutch-site) based on documents matching urls pattern. The idea is as follow: 1. The default configuration is as usualy the nutch-default.xml file 2. An xml file can map some url regexp to some many others configurations files (that override the nutch-default): nutch:conf url regexp=http://www.mydomain1.com/*; !-- A set of nutch properties that override the nutch-default for this domain -- property nameproperty1/name valuevalue1/name /property /url /nutch:conf What do you think about this? Looking deeper, this is more messy that I thought... Some changes would be required to the plugin instantiation mechanisms, e.g.: Extension.getExtensionInstance() - getExtensionInstance(NutchConf) ExtensionPoint.getExtensions() - getExtensions(NutchConf) PluginRepository.getExtensionPoint(String) - getExtensionPoint(String, NutchConf) etc, etc... The way this would work would be similar to the mechanism described above: if plugin instances are not created yet, they would be created once (based on the current NutchConf argument), and then cached in this NutchConf instance. And also the plugin implementations would have to extend NutchConfigured, taking NutchConf as the argument to their constructors - because now the Extension.getExtensionInstance would pass the current NutchConf instance to their contructors. That's exactly what I had in mind while speaking about a dynamic NutchConf with Doug. For me it's a +1 The only think I don't really like is extending the NutchConfigured, but it is the most secured way to implement it. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Latest version of Mapred
Thanks for the fast response, Do you know where I can find a compressed version? Here are the nightly builds: http://cvs.apache.org/dist/lucene/nutch/nightly/ Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: vote results.
Just continue voting I will continue with my tally sheet. :-) Why not creating a wiki page... so that you don't have to do this bad work. Jérôme
Re: [Fwd: Crawler submits forms?]
What people think if we collect a list of issues and make a voting iteration? +1
Hard-coded Content-type checks
Hi, I would like to remove all the hard-coded content-type checks spread over all the parse plugins. In fact, the content-type/plugin-id mapping is now centralized in the parse-plugin.xml file, and there's no more needs for the parser to check the content-type. The basic idea was: 1. The developer has the responsibility to add in the plugin.xml of his parser the content-type(s) handled. 2. Then, the administrator has the ability to use a parser for any content-type he wants. 3. The ParserFactory WARN the administrator if a parser is mapped to a content-type that was not initially designed to handle this content-type (from the plugin.xml file). So there is no more needs for hard-coded content-type checks. That's the administrator responsibility to take care of the content-type/plugin-id mappings. For instance, in my use case, I have added the application/xhtml+xml content-type mapped to the parse-html parser. But with the actual hard coded content-type check in parse-html, the parse-html plugin cannot handled the application/xhtml+xml content. If there is no objection, I will commit these changes in the next hours. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Standard metadata property names in the ParseData metadata
+1 A simple solution that provides a standard way to access common meta data. Great! -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Fwd: Crawler submits forms?]
+1 for a 0.7.2 release. Here are the issues/revisions I can merge to 0.7 branch. These changes mainly concern the parser-factory changes (NUTCH-88) http://issues.apache.org/jira/browse/NUTCH-112 http://issues.apache.org/jira/browse/NUTCH-135 http://svn.apache.org/viewcvs.cgi?rev=356532view=rev http://svn.apache.org/viewcvs.cgi?rev=355809view=rev http://svn.apache.org/viewcvs.cgi?rev=354398view=rev http://svn.apache.org/viewcvs.cgi?rev=326889view=rev http://svn.apache.org/viewcvs.cgi?rev=321250view=rev http://svn.apache.org/viewcvs.cgi?rev=321231view=rev http://svn.apache.org/viewcvs.cgi?rev=306808view=rev http://svn.apache.org/viewcvs.cgi?rev=293370view=rev http://svn.apache.org/viewcvs.cgi?rev=292865view=rev http://svn.apache.org/viewcvs.cgi?rev=292035view=rev [EMAIL PROTECTED] Piotr, what about the italian translation? 0.7.2 could be a good candidate for a commit. no? This has been fixed in the mapred branch, but that patch is not in 0.7.1 . This alone might be a reason to make a 0.7.2 release. http://svn.apache.org/viewcvs.cgi?view=revrev=348533 I would be happy to see some more parser selection problems fixed but looks like Jerome is working hard also to get stuff fixed, may we can wait until that. I think we can wait for the enhancement proposed by Chris today: Adding an alias in parse-plugin.xml file and use a content-type/extension-id mapping instead of content-type/plugin-id. For further improvements, the new mime-type repository based on freedesktop mime-type will be needed. I cannot reasonably include this in 0.7.2, but I think it will be in trunk by the end of the year. What reasonable target date can we planned for a 0.7.2 ? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)
The total number of hits (approx) is 2,780,000,000. BTW, I find it curious that the last 3 or 6 digits always seem to be zeros ... there's some clever guesstimation involved here. The fact that Google Suggest is able to return results so quickly would support this suspicion. For more informations about fake Google counts, I suggest you to take a look to some tests performed by Jean Véronis, a French academic : http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Urlfilter Patch
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file extensions associated to each content-type, we can build a list of file extensions to include (other ones will be excluded) in the fecth process. No? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [Nutch-dev] incremental crawling
Sounds really good (and it is requested by a lot of nutch users!). +1 Jérôme On 12/1/05, Doug Cutting [EMAIL PROTECTED] wrote: Matt Kangas wrote: #2 should be a pluggable/hookable parameter. high-scoring sounds like a reasonable default basis for choosing recrawl intervals, but I'm sure that nearly everyone will think of a way to improve upon that for their particular system. e.g. high-scoring ain't gonna cut it for my needs. (0.5 wink ;) In NUTCH-61, Andrzej has a pluggable FetchSchedule. That looks like a good idea. http://issues.apache.org/jira/browse/NUTCH-61 Doug -- http://motrech.free.fr/ http://www.frutch.org/
Re: Urlfilter Patch
Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the document's URL. So, another alternative, could be to exclude only files extensions that are registered in the mime-type repository (some well known file extensions) but for which no parser is activated. And accepting all other ones. So that the .foo files will be fetched... Jérôme
Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser
Do we talk about parsing rdf or do we discuss to store parsed html text in rdf and convert it via xslt to pure text? I may misunderstand something. I very like the idea of a general rdf parser. Back in the days i played around with jena.sf.net Parsing yes, replace nutch sequence file and the concept of Wriatbles with xml - is from my point of view a bad idea. One more time. Please read the proposal one more time and my responses. The proposal doesn't suggest to replace the way data are stored in Nutch. It is just a proposal of a generic xml parser (as the title suggest it) :-) I'm the last that inhibit innovation, but I would love to see nutch able to parse billion of pages. Today, parsing billion of pages is not the only challenge of search engines (look at Google that no more displays the number of indexed pages) The parsing of a lot of content types, the language technologies (language specific stemmatization, analysis, querying, summarization, ...) are some other new challenges... The low level challenges are importants, but they must not be a brake for high level processes. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Lucene or Nutch
I would be disappointed by this move - language identifier is an important component in Nutch. Now the mere fact that it's bundled with Nutch encourages its proper maintenance. If there is enough drive in terms of willingness and long-term commitment it would make sense to move it to a separate project on its own (or maybe as a part of Jakarta Commons), but moving it into a catch-all purely optional category like Lucene contrib would increase risks that it slides into oblivion... Ok, Andrzej, I really understand your meaning. But more and more people are contacting me directly in order to use the language-identifier, but not as a nutch plugin, simply as a standalone library. They get confused when I explain them that they need the nutch jar in order to use the language-identifier. That's why I would like to make it a standalone jar. A short-term solutions could be to move the core classes (which have no dependencies on nutch) to a new lib-plugin (lib-lang for instance and adding a dependecy to this plugin in the language-identifier), so that this code could be used as a standalone lib. Are you ok, with such changes? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Lucene or Nutch
Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik, I was thinking about it for a while, but don't take time to. This mail is a good oportunity... In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? Doug? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: standard version of log4j
hmmm.. so that means if we want to customize logging it would be for every plugin potentially? Perhaps a common logger would atleast make some degree of sense. I really think it make sense. When I fixed the issue about plugin dependencies, I began to create a log4j plugin in order to remove all the log4j versions imported by many plugins (what you suggest). But it is not so simple. In fact, parse-rss and parse-pdf uses in their code some log4j imports just to redirect the log4j output to the java's native logger (They don't really customize it). The imports of log4j are only used by some others jars imported by the plugins (not a direct dependency). If these jars the plugins depends on use some common log4j features, it seems there's no problem to remove the log4j jars in each plugin and add a dependency to a new lib-log4j plugin. But the only ways to check for no regression are: * Look in the source code of PDFBox and other jars imported by plugins and that use log4j and checks that they are able to use any other log4j-1.2.xversion * Create a lib-log4j plugin, remove all log4j jars and add a dependency to lib-log4j plugin in all the plugins that previously imported log4j.jar , and then perform a runtime test of these plugins and cross fingers But sure, I really think it make sense. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect
There is one potential problem that I see -- Nutch plugins require explicit JAR references. If you want to switch between algorithms you'll need to either put all Carrot2 JARs in the descriptor, put them in CLASSPATH before Nutch starts or do some other trickery with class loading. Only available in the trunk, you can also now define some inter-plugins dependencies using plugins identifiers instead of explicit jar references. These dependencies are then checked for availability and added to the classloader at runtime. Take a look at analyze-fr and analyze-de plugins that depends on lib-lucene-analyzers. You can also notice, that now, for instance all plugins depends on the nutch-extensionpoints plugin. For instance, I recently notice that many plugins import a log4j.jar. It would be a good idea to define a lib-log4j plugin, and add a dependency on this plugin for each plugins that import log4j.jar in their lib (of course, we must take care of the log4j version used) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: plugin analyzer
I think would be neat to have the NutchAnalyzer also as a plugin, from my understanding right now if I want to analyze in a different way, I need to hack the nutch source code, if we are going to have different plugins for different analyzers that will help. Some specific application may use porter analyzer, some other uses Snowball for Italian ..., with the plugin approach these will coexist nicely. Same thing would be for providing summaries, for instance if we enable clustering the way is summarized the search result helps to have meaningful clusters. Let me know if you find it as an attractive feature ;-), I can find some free-time and do the coding. Yes, it is definitvely an attractive feature! I have recently commited in the trunk a support for multi-lingual analyzer plugins. There is an Analyzer Extension point, so that you can develop your own analyze-plugins. For now, the analyzer factory uses a plugin depending on the result of the language identifier. I have committed two analyze plugins, one for french and one for german. They are just some wrappers of the Lucene french and german analyzers. By default, these plugins are not deployed, since: 1. they are at an early testing stage. 2. these analyzers make sense only if some query analyzers are provided too (not yet done). You can take a look at the proposal I made earlier (not finished since I worked on other issues for now): http://wiki.apache.org/nutch/MultiLingualSupport Cheers Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: plugin analyzer
I read about the MultiLingualSupport, but I didn't see it in the repository, I think is cool. The analyzer extension point is defined by the Analyzer abstract class: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java The default analyzer is this one: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java The choice of the analyzer to use is done by the AnalyzerFactory: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/AnalyzerFactory.java The german analyzer is located at: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-de/ and the french one at: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/plugin/analysis-fr/ Yes, I actually hacked the src code to provide stemming and I changed the analyzer, added a new query-stemm plugin and changed the summarizer (as the terms were not highlighted after using the stemmer). Sounds good! Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/