-1
Maybe it would be a better idea to go for 0.7 branch and schedule a new
0.7.1 release in short time?
But +1 to include it in a 0.7.1 release !!
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
expected:el but was:sv
junit.framework.ComparisonFailure: expected:el but was:sv
As I suspect it is a result of latest updates to LanguageIdentifier
plugin or its tests. I am not deep into it I will not try to debug it
myslef at the moment - so just wanted you to know about the issue.
You
I am using JDK 1.5 on
Windows - I can test it on 1.4,1.5 on linux tomorrow - maybe this is the
problem.
OK. Thanks
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
It works on my Linux box - with both JDK 1.4 and 1.5.
ok. so it seems to be constent with my conf.
I will try to track it down.
I assume it is an encoding problem of the Ngram profile files, but I have no
time evening.
Regards
Jérôme
It looks like you have commited your changes to tags directory. You
should do it in branches. I think there is no way in SVN to force
immutability of tags :(.
Oops, sorry.
I commit my changes in the branches directory right now.
Thks Piotr.
Regards
Jerome
--
http://motrech.free.fr/
I see several instances of 'analySer' in comments/javadoc and some
variables. That should probably be changed to american version -
analyzer, for consistency's sake.
Yes, that's right.
Thanks.
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I did a little digging and it appears that lang ends
up being null (couldn't quite track down where lang
should have been set). Not sure if it is a proper
fix, but changing doc.getField(lang).stringValue()
to doc.get(lang), makes my little crawl complete.
lang is null cause you don't have
I agree it is important to have the NGramProfile.getSimilarity() method.
However, I think it is also important that it is consistent with the
scoring
that LanguageIdentifier uses, even if LanguageIdentifier optimises the
implementation. Looking at the code I see that the two scoring
Tom,
I have created the NUTCH-86 issue to report the needed changes in the
LanguageIdentifier we discussed in this thread.
The issue is available at http://issues.apache.org/jira/browse/NUTCH-86
Regards
Jérôme
I see several instances of 'analySer' in comments/javadoc and some
variables. That should probably be changed to american version -
analyzer, for consistency's sake.
Corrected/Committed
(http://svn.apache.org/viewcvs.cgi?rev=265020view=rev)
Regards
Jérôme
--
http://motrech.free.fr/
Glancing at other Apache projects in subversion, I see that httpd uses
branch names like 2.2.x and tag names like 2.2.4. That's a little
cryptic. I propose that we use branch names like branch-2.4 and tag
names like release-2.4.1. What do folks think?
+1
Jérôme
--
http://motrech.free.fr/
I am a bit lost but just a quick check - shouldn't it also be committed
in Release-0.7 branch?
No, the analyzer extension-point is commited only in trunk.
It's a new feature, so I follow Committer's Rules (
http://wiki.apache.org/nutch/Committer's_Rules)
;-)
Regards
Jérôme
--
Michael,
the solution is perhaps to use Jakarta Commons DateUtils.parseDate method:
http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[])
It will gives something like:
Date parsedDate =
it works great (see the new function bellow). But we'll have to add
commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries.
Are there any objections? How is the procedure to add it?
There's already commons-logging, in nutch libs, so I think there's no
problem to add
There's already commons-logging, in nutch libs, so I think there's no
problem to add commons-lang.
Moreover it is under Apache License, so there's no prolem.
I will add it while committing your patch.
No objections for adding commons-lang to the nutch lib.
As it is a generic lib, I plan
to get regex-normalize.xml to work i must put:
in nutch-site.xml
In nutch-default.xml there is set:
Is this a bug or a feature? =)
nutch-site.xml overrides properties defined in nutch-default. So:
* If you remove urlnormalizer.class property from nutch-default it must
still uses the one
i think i expressed it wrong. The Question was if its a feature or a bug
that regex-normalize.xml is used only after this changes.
the regex-normalize.xml is used only after you specify that you want to use
the RegexUrlNormalizer implementation. So it is used only if you specify
Hi,
I have just committed some modifications that enable to have some
dependencies between plugins.
I would like to apply this mechanism to parse-ms* related plugins that both
uses jakarta poi code.
The idea is: instead of duplicating jakarta poi related jar in each lib
directory of parse-ms*
I have just committed some modifications that enable to have some
dependencies between plugins.
This mechanism already works, since a plugin use jar urls from all
dependent plugins in its own class-loader.
Ok. So, after a long private mail exchange with Stefan (thanks for your time
and
The change doesn't reflect in the screen after I
re-compile the Nutch code and re-launch the tomcat.
Do you re-deploy the web app?
--
http://motrech.free.fr/
http://www.frutch.org/
Since the plugins can specify some dependencies each over, it raises an
administrator problem.
For a Nutch administrator, it is not user-friendly to specify which plugins
to activate/deactivate.
With plugin inter-dependencies, the administrator need to know that a plugin
depends on another one
You may should discuss such things before you 'committed' a new
feature that already exists.
I normally ready most of the nutch mails. What was the date and subject?
I may overseen this one.
I don't know, it's Stefan's sentence, not mine, so, please ask to Stefan.
Regards
Jérôme
--
But other than that, your analysis is correct, probably there should be
an application/xml added to the list of handled content types. But
this is further complicated by the fact, that Nutch doesn't do the right
thing now if you have more than one plugin handling the same mime type...
I have
I may have some time to work on this over the next few days if no one else
does. So, if you're taking the lead on this, I volunteer my help if you'd
like it.
Hi Chris,
Thanks for your help. It seems that nobody starts working on this for now (I
planned to do it in the next weeks).
First of
Jerome: Give me a shout if you need a hand on this. I'll be happy to
help and as it happens, I'll be available in the next few weeks.
Sébastien,
Great! As I mentioned in my last comment on JIRA, please synchronize with
Chris on this point.
I'm currently coding on other subjects and don't have
What about a default-plugin as Andrzej proposed.
The default plugin mechanism is integrated in the parse-plugins descriptor
using the * content-type
It should behave like
the unix-command strings. Does this make sense? Are you on it too?
But we don't planned to develop it
Otherwise, I
I noticed that HTML-Parser Plugin has references to xercesImpl.jar which
is plased in
src/plugin/parse-rss/lib/xercesImpl.jar
Where do you find some references to xercesImpl .jar in HTML-Parser plugin?
(If so, I dont understand how it can compile since the build scripts never
import any lib
Likely missing file:/. If I get rid of lines 617-622
of conf/nutch-default.xml
Oups, sorry.
I made this last change just after testing the whole patch.
And I doesn't test it once again since I was sure it was a minor change.
I correct this right now. Sorry.
Regards
Jérôme
--
I think would be neat to have the NutchAnalyzer also as a plugin, from my
understanding right now if I want to analyze in a different way, I need to
hack the nutch source code, if we are going to have different plugins for
different analyzers that will help. Some specific application may use
I read about the MultiLingualSupport, but I didn't see it in the
repository, I think is cool.
The analyzer extension point is defined by the Analyzer abstract class:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java
The default analyzer
There is one potential problem that I see -- Nutch plugins require
explicit JAR references. If you want to switch between algorithms you'll
need to either put all Carrot2 JARs in the descriptor, put them in
CLASSPATH before Nutch starts or do some other trickery with class
loading.
Only
hmmm.. so that means if we want to customize logging
it would be for every plugin potentially?
Perhaps a common logger would atleast make some degree
of sense.
I really think it make sense.
When I fixed the issue about plugin dependencies, I began to create a log4j
plugin
in order to remove
Yes, Lucene is the best fit for what you're after. Nutch is built on
Lucene, and adds web crawling on top. You don't need a web crawler,
so using Lucene directly is the best fit - of course you'll have to
write code to integrate Lucene.
Erik,
I was thinking about it for a while, but don't
I would be disappointed by this move - language identifier is an
important component in Nutch. Now the mere fact that it's bundled with
Nutch encourages its proper maintenance. If there is enough drive in
terms of willingness and long-term commitment it would make sense to
move it to a
Do we talk about parsing rdf or do we discuss to store parsed html
text in rdf and convert it via xslt to pure text?
I may misunderstand something. I very like the idea of a general rdf
parser. Back in the days i played around with jena.sf.net
Parsing yes, replace nutch sequence file and the
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file
Sounds really good (and it is requested by a lot of nutch users!).
+1
Jérôme
On 12/1/05, Doug Cutting [EMAIL PROTECTED] wrote:
Matt Kangas wrote:
#2 should be a pluggable/hookable parameter. high-scoring sounds like
a reasonable default basis for choosing recrawl intervals, but I'm sure
Right, but the URL filters run long before we know the mime type, in
order to try to keep us from fetching lots of stuff we can't process.
The mime type is not known until we've fetched it.
Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the
The total number of hits (approx) is 2,780,000,000. BTW, I find it
curious that the last 3 or 6 digits always seem to be zeros ... there's
some clever guesstimation involved here. The fact that Google Suggest is
able to return results so quickly would support this suspicion.
For more
Hi,
I would like to remove all the hard-coded content-type checks spread over
all the parse plugins.
In fact, the content-type/plugin-id mapping is now centralized in the
parse-plugin.xml file, and there's no
more needs for the parser to check the content-type.
The basic idea was:
1. The
+1
A simple solution that provides a standard way to access common meta data.
Great!
--
http://motrech.free.fr/
http://www.frutch.org/
+1 for a 0.7.2 release.
Here are the issues/revisions I can merge to 0.7 branch.
These changes mainly concern the parser-factory changes (NUTCH-88)
http://issues.apache.org/jira/browse/NUTCH-112
http://issues.apache.org/jira/browse/NUTCH-135
http://svn.apache.org/viewcvs.cgi?rev=356532view=rev
What people think if we collect a list of issues and make a voting
iteration?
+1
Just continue voting I will continue with my tally sheet. :-)
Why not creating a wiki page... so that you don't have to do this bad
work.
Jérôme
Thanks for the fast response,
Do you know where I can find a compressed version?
Here are the nightly builds:
http://cvs.apache.org/dist/lucene/nutch/nightly/
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Andrzej,
How do you choose the NutchConf to use ?
Here is a short discussion I had with Doug about a kind of dynamic NutchConf
inside the same JVM:
... By looking at the mailing lists archives it seems that having some
behavior depending on the documents URL is a recurrent problem (for instance
Excuse me in advance, I probably missed something, but what are the use
cases for having many NutchConf instances with different values?
Running many different tasks in parallel, each using different config,
inside the same JVM.
Ok, I understand this Andrzej, but it is not really what I call
A related issue is that these two plugins replicate a lot of code. At
some point we should try to fix that. See:
http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
I have beginning working on this. Nobody else? Can I go on?
Jérôme
--
http://motrech.free.fr/
I have the same problem too.
I don't understand what happens.
In fact, the CommandRunner returns a -1 exit code, but nothing in the error
output and the good string in the standard output (nutch rocks nutch rocks
nutch rocks).
All seems to be ok but the exit code.
Jérôme
On 1/9/06, Piotr
... in fact, not really... really unrelated !!!
I remove it immediately.
Thanks
On 1/9/06, Doug Cutting [EMAIL PROTECTED] wrote:
[EMAIL PROTECTED] wrote:
--- lucene/nutch/trunk/src/plugin/build.xml (original)
+++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006
@@ -6,13
the following code would fail in case the meta tags are in upper case
Node nameNode = attrs.getNamedItem(name);
Node equivNode = attrs.getNamedItem(http-equiv);
Node contentNode = attrs.getNamedItem(content);
This code works well, because Nutch HTML Parser uses Xerces
Hi Stefan,
No in fact, I have refactored the code of protocol-http plugins, not html
parser.
So, I don't think the log4 error comes from this code.
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I am wondering Analyzer of nutch in svn trunk is chosen by
languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did).
It's not really choosen by the languageidentifier, but coosen regarding the
value of the lang attribute (for now, that's right, only the
languageidentifier add this
Any plan to implement this ? I mean move LanguageIdentifier class
intto nutch core.
As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but
+1. Other local modifications which I use frequently:
* exporting a list of supported languages,
* exporting an NGramProfile of the analyzed text,
* allow processing of chunks of input (i.e.
LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
very useful if the text to
Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a
new issue and attach the file.
Perhaps you can use this already existing issue
http://issues.apache.org/jira/browse/NUTCH-23
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Is it reasonable to guess language info. from target servers geographical
info.?
Yes, it could be another clue to guess language.
But the problem is then to find how to use all these indices.
For instance, the actual solution is the easiest one, but certainly not the
more efficient one:
For
+1
On 2/1/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
+1
Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki:
Hi,
I just found out that it's not possible to invoke main() methods of
plugins through the bin/nutch script. Sometimes it's useful for
testing and debugging - I can do it
Hi,
It seems that the javaswf.jar lib was builded using jdk 1.5:
class file has wrong version 49.0, should be 48.0
Does I missed something, or Nutch should still be compiled using jdk 1.4.x ?
Please confirm, so that I can commit a new javaswf.jar builded with jdk 1.4
Regards
Jérôme
--
Hi all,
I just notice an inconsistency when there is a parsing failure :
1. The Fetcher return an empty ParseImpl instance (it contains no metadata,
especially SEGMENT_NAME_KEY and SIGNATURE_KEY)
2. Then, the Indexer tries to add the fields segment and digest from the
metadata keys
Hi,
I have made some experiments with the 3.0-alpha1 version of Jakarta POI
(used by parse-msword and parse-mspowerpoint).
Since this version contains the hwpf package it enables to parse msword
documents too (the actual version in lib-jakarta-poi plugin doesn't contain
this package).
The benefit
Is this happening with the latest code?
Yes.
But by looking in the svn repository . it is my fault ... sorry
(NUTCH-139)
I fix that right now.
Thanks
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
There are a number of duplicated libs in the plugins, namely:
Isn't it already reported in http://issues.apache.org/jira/browse/NUTCH-196?
I have still provided a patch for a log4j lib.
If there is no objection, I will commit it and go ahead for
* lib-commons-httpclient
* lib-nekohtml
Jérôme
Yes, there is an easier way. Implement a custom task to which you'll
pass a path to plugin.xml and a name for a path. The task (Java code)
will create a named (id) path object which can be subsequently used in
ant with classpath refid=xxx /.
This requires a custom ant task, but as you
may you will find that interesting also:
http://maven.apache.org/using/multiproject.html
Thanks Stefan.
Maven seems to be a really good project software management tool.
But for now, I don't plan to migrate to maven...
(I don't have enought knowledge about it and so I don't have a good overview
Sounds very good! I may missed - that are you able to extract the
dependencies from the plugin.xml without hacking ant?
Yes, by using the xmlproperty task: it defines a property for each path
found in the xml document
( http://ant.apache.org/manual/CoreTasks/xmlproperty.html )
Jérôme
--
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/parse/pdf/packa
ge-summary.html org.apache.nutch.parse.pdf (Nutch 0.7.1 API)
but I dont see it in the source of 0.7.1 downloaded
I see it on cvs here:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/parse-pdf/s
Putting the wellformed version of the plugin code you provided generated
the follwong exception:
Does the nutch-extensionpoints plugin is activated?
This is something google does very well, and something nutch must match
to compete.
Richard, it seems you are a real pdf guru, so any code contribution to nutch
is welcome.
;-)
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Yes, but please do not cross-post - many of us are subscribed to both
groups, and we're getting multiple copies of your posts...
+1
I agree, this is inconsistent and should be changed. I think all places
should use -1 as a magic value, because it's obviously invalid.
+1
Richard, could you
Calling compile-core for every plugin makes builds really slow.
I was surprised that nobody complain about this... ;-)
I
think it's safe to assume that the core has already been compiled before
plugins are compiled. Don't you?
It just ensure that the last modified core version is
In a distributed configuration one needs to rebuild the job jar each
time anything changes, and hence must check all plugins, etc. So I
would appreciate it if this didn't take quite so long.
Make sense!
Here is my proposal. For each plugin:
* Define a target containing core (will be used when
Adding DOAP for Nutch. Contributed by Chris Mattmann.
Added:
lucene/nutch/trunk/site/doap.rdf
Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
On 3/3/06, Doug Cutting [EMAIL PROTECTED] wrote:
Jérôme Charron wrote:
Here is my proposal. For each plugin:
* Define a target containing core (will be used when building single
plugin)
* Define a target not containing core (will be used when building whole
code)
I commit this as soon
In fact, my first need was to be able to configure the boost for
RawFieldQueryFilter.
The idea is then to give to the user a better control of boost values by
simply :
* add a setBoost(float) method to RawFieldQueryFilter.
* (add a setLowerCase(boolean) method to RawFieldQueryFilter)
* Add some
I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.
Are you sure?
Take a look at this search results:
It seems that the usage of AnalyzerFactory was removed while porting Indexer
to map/reduce.
(AnalyzerFactory is no more called in trunk code)
Is it intentional?
(if no, I have a patch that I can commit, so thanks to confirm)
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
It's not only faster, it also scales better for large and complex
expressions, it is also possible to build automata from several
expressions with AND/OR operators, which is the use case we have in
regexp-utlfilter.
It seems awesome!
Does somebody plans to switch to this lib in nutch?
Does
Thanks for volunteering, you're welcome ... ;-)
Good job Andrzej !;-)
So, That's now in my todo list to check the perl5 compatibility issue and to
provide some benchs to the community...
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I updated to the latest SVN revision (385691) today, and I am now seeing
a
Null Pointer exception in the AnalyzerFactory.java class.
Fixed (r385702). Thanks Chris.
NOTE: not sure if returning null is the right thing to do here, but hey,
at
least it made my crawl finish! :-)
It is the
Beside that, we may should add a kind of timeout to the url filter in
general.
Since it can happen that a user configure a regex for his nutch setup
that run in the same problem as we had run right now.
Something like below attached.
Would you agree? I can create a serious patch and test it
If it were easy to implement all java regex features in
dk.brics.automaton.RegExp, then they probably would have. Alternately,
if they'd implemented all java regex features, it probably wouldn't be
so fast. So I worry that attempts to translate are doomed. Better to
accept the differences:
I've implemented the spelling correction for the RSS Opensearch feed,
hopefully in keeping with the opensearch guidelines.
If this format is ok, I'll submit an optional patch alongside the
current one at http://issues.apache.org/jira/browse/NUTCH-48 .
+1
Jérôme
--
http://motrech.free.fr/
I don't think it upside down. Plugins should not share packages with
core code, since that would permit them to use package-private APIs.
Also, re-arranging the code to make the javadoc nice is right, since the
javadoc is a primary means of describing the code.
Yes, but what I mean is that
I'm reluctant to move the extension interface away from the parameter
and return value classes used by that interface.
I'm reluctant too... I asked, in case someone has a magic idea...
Could we instead add a
super-interface that all extension-point interfaces extend? That way
all of the
No, I don't think so. These are strongly related bundles of plugins.
When you change one chances are good you'll change the others, so it
makes sense to keep their code together rather than split it up. Folks
can still find all implementations of an interface in the javadoc, just
not always
PMD looks like a useful such tool:
http://pmd.sourceforge.net/ant-task.html
I would not be opposed to integrating PMD or something similar into
Nutch's build.xml. What do others think? Any volunteers?
+1 (Very configurable, very good tool!)
With code coverage... I don't know. It's
up to you guys -- you spend much more time on Nutch code than I do and
you know best what is needed and what isn't.
My feeling was simply that the closest we are to Nutch-1.0, the more be need
some QA metrics (for us and for nutch users). No?
Jérôme
that right now it is checking only main code (without plugins?).
Yes, that's correct -- I forgot to mention that. PMD target is hooked up
with tests and stops the build if something fails. I thought the core
code should be this strict; for plugins we can have more relaxed rules
-1
Since
My feeling was simply that the closest we are to Nutch-1.0, the more be
need
some QA metrics (for us and for nutch users). No?
I absolutely agree Jérôme, really. It's just that developers usually
tend to hook up dozens of QA plugins and never look at what they output
(that's the usual
I will make it totally separate target (so test do not
depend on it).
+1
The goal is to allow other developers to play with pmd easily but at the
same time I do not want the build to be affected.
+1
I would like also to look at possibility to generate crossreferenced HTML
code from
Do you guys have any additional insights / suggestions whether NUTCH-240
and/or NUTCH-61 should be included in this release?
NUTCH-240 : I really like the idea, but for now, I agree with that is API is
still ugly. I would like to help in the next weeks...
So for me it should not be included in
It seems there is an inconsistency with content-type handling in Nutch:
1. The protocol level content-type header is added in content's metadata.
2. The content-type is then checked/guessed while instanciating the Content
object and stored in a private field
(at this step, the Content object can
I found your idea very interesting. I will be interested to contribute to
the Parse Plugins Framework. I have developed similar one using Lucene.
The
project name is Lius.
Hi Rida,
Yes, I know Lius.
It seems very interesting, and I think it would be very interesting too
if we can merge our
Piotr, please keep oro-2.0.8 in pmd-ext
I do not agree here - we are going to make a new release next week and
releasing with two versions of oro does not look nice. oro is quite
stable product and changes are in fact minimal:
http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES
OK for
I would like to come back on this issue:
The Content object holds two content-types:
1. The raw content-type from the protocol layer (http header in case of
http) in the Content's metadata
2. The guessed content-type in a private field content-type.
When a ParseData object is created, it takes
Hi all,
Just for fun, I have created a public nutch calendar on Google Calendar.
You can add it to your Google calendars or acces it via these URLs:
Feed URL is :
http://www.google.com/calendar/feeds/[EMAIL PROTECTED]/public/basic
ICAL URL is :
http://www.google.com/calendar/ical/[EMAIL
Is this just needed for references from javadoc? If so, then this can
be copied to build/docs, no?
Yes. Committed.
Jérôme
parse-oo plugin manifest is valid with plugin.dtd
Oops, I didn't catch that... Thanks!
No problem Andrzej.
It is just a cosmetic change since the plugin.xml are not validated at
runtime (it is in my todo list),
and the contentType and pathSuffix parameters are more or less deprecated.
Jérôme
Are you mainly concerned with charset in Content-Type?
Not specifically.
But while looking at these content-type inconsistency, I noticed that there
is some prossible
troubles with charset in content-type.
Currently, what happens when Content-Type exists in both HTTP layer and in
META tag
1 - 100 of 128 matches
Mail list logo