Re: [C2-devel] about the question of clustering-carrot2

2005-12-09 Thread Dawid Weiss
Hi Charlie, Don't cross-post to two lists at once. The question you asked is relevant to C2, not Nutch, so I'll reply to it there. Dawid charlie wrote: Dear all, Currently I’m using the Nutch plug-in “clustering-carrot2” and would like to ask for some help. When I built the search

Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Andrzej Bialecki
Hi, I made an experiment with Google, to see if they use a similar approach. I find the results to be most interesting. I selected a query which is guaranteed to give large result sets, but is more complicated than a single term query: http com. The total number of hits (approx) is

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
The total number of hits (approx) is 2,780,000,000. BTW, I find it curious that the last 3 or 6 digits always seem to be zeros ... there's some clever guesstimation involved here. The fact that Google Suggest is able to return results so quickly would support this suspicion. For more

Re: nutch questions

2005-12-09 Thread Stefan Groschupf
Ken, may the user mailing list would be a better place for such questions. The size of your index depends on you configuration(what kind of index filter plugins you use) You can say a document in the index needs 10KB plus the meta data like date, content type or category of the page.

Re: nutch questions

2005-12-09 Thread Ken van Mulder
Thanks Stefan. I'll resend this to the user list as well. Just thought the dev list might be better since we're using the map/reduce version. Thanks! Stefan Groschupf wrote: Ken, may the user mailing list would be a better place for such questions. The size of your index depends on you

parse.getData().getMetadata().get(propName) is NULL?

2005-12-09 Thread Jack Tang
Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes parse.getData().getMetadata().get(propertyName) is NULL. In fact when i stepped in the source code, the value of propertyName is not NULL. So can someone explain this? Thanks /Jack -- Keep

Re: parse.getData().getMetadata().get(propName) is NULL?

2005-12-09 Thread Stefan Groschupf
Jack, discussed here in detail: http://issues.apache.org/jira/browse/NUTCH-133 I will provide a patch just fixing this issue very soon. Stefan Am 09.12.2005 um 20:04 schrieb Jack Tang: Hi I am going to standardize some fields which I stored in my parser plugin. But I found that sometimes

[jira] Created: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
http header meta data are case insensitive in the real world (e.g. Content-Type or content-type) Key: NUTCH-135 URL: http://issues.apache.org/jira/browse/NUTCH-135 Project:

[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ] Stefan Groschupf updated NUTCH-135: --- Attachment: contentProperties_patch.txt As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that solve the problem of case insensitive

[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12359961 ] Andrzej Bialecki commented on NUTCH-135: - Since you already are working on this issue, I'd like to ask you to take a look at NUTCH-3, and see if you can solve this

[jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ] Stefan Groschupf commented on NUTCH-135: Andrzej, that is easy to add to the ContentProperties object and sure I can do that. However first I would love to get a OK

[jira] Assigned: (NUTCH-3) multi values of header discarded

2005-12-09 Thread Stefan Groschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ] Stefan Groschupf reassigned NUTCH-3: Assign To: Stefan Groschupf multi values of header discarded Key: NUTCH-3 URL: