Hi Charlie,
Don't cross-post to two lists at once. The question you asked is
relevant to C2, not Nutch, so I'll reply to it there.
Dawid
charlie wrote:
Dear all,
Currently I’m using the Nutch plug-in “clustering-carrot2” and would
like to ask for some help. When I built the search
Hi,
I made an experiment with Google, to see if they use a similar approach.
I find the results to be most interesting. I selected a query which is
guaranteed to give large result sets, but is more complicated than a
single term query: http com.
The total number of hits (approx) is
The total number of hits (approx) is 2,780,000,000. BTW, I find it
curious that the last 3 or 6 digits always seem to be zeros ... there's
some clever guesstimation involved here. The fact that Google Suggest is
able to return results so quickly would support this suspicion.
For more
Ken,
may the user mailing list would be a better place for such questions.
The size of your index depends on you configuration(what kind of
index filter plugins you use)
You can say a document in the index needs 10KB plus the meta data
like date, content type or category of the page.
Thanks Stefan. I'll resend this to the user list as well. Just thought
the dev list might be better since we're using the map/reduce version.
Thanks!
Stefan Groschupf wrote:
Ken,
may the user mailing list would be a better place for such questions.
The size of your index depends on you
Hi
I am going to standardize some fields which I stored in my parser
plugin. But I found that sometimes
parse.getData().getMetadata().get(propertyName) is NULL. In fact
when i stepped in the source code, the value of propertyName is not
NULL.
So can someone explain this? Thanks
/Jack
--
Keep
Jack,
discussed here in detail:
http://issues.apache.org/jira/browse/NUTCH-133
I will provide a patch just fixing this issue very soon.
Stefan
Am 09.12.2005 um 20:04 schrieb Jack Tang:
Hi
I am going to standardize some fields which I stored in my parser
plugin. But I found that sometimes
http header meta data are case insensitive in the real world (e.g. Content-Type
or content-type)
Key: NUTCH-135
URL: http://issues.apache.org/jira/browse/NUTCH-135
Project:
[ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]
Stefan Groschupf updated NUTCH-135:
---
Attachment: contentProperties_patch.txt
As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that
solve the problem of case insensitive
[
http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12359961 ]
Andrzej Bialecki commented on NUTCH-135:
-
Since you already are working on this issue, I'd like to ask you to take a look
at NUTCH-3, and see if you can solve this
[
http://issues.apache.org/jira/browse/NUTCH-135?page=comments#action_12360025 ]
Stefan Groschupf commented on NUTCH-135:
Andrzej, that is easy to add to the ContentProperties object and sure I can do
that. However first I would love to get a OK
[ http://issues.apache.org/jira/browse/NUTCH-3?page=all ]
Stefan Groschupf reassigned NUTCH-3:
Assign To: Stefan Groschupf
multi values of header discarded
Key: NUTCH-3
URL:
12 matches
Mail list logo