RE: Is there some arbitrary limit on content stored for use by summaries?
Yeah I've played with that. My current value for file.content.limit is 10. That's considerably longer than the page I'm having problems with. Its fast approaching that time where I have to split the page in to lots of smaller pages. :-( Thankfully owning the site that we use nutch on appears to be the only solution to this summary issue. Tim.. -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] Sent: 22 April 2010 21:56 To: nutch-user@lucene.apache.org Subject: Re: Is there some arbitrary limit on content stored for use by summaries? Try refetching with a different value for : file.content.limit 65536 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Julien On 22 April 2010 18:44, Tim Redding wrote: > Hey Arkadi, > > I've tried upping the value to Integer.MAX_VALUE but it still doesn't > show a relevant summary. :-( > > Any other ideas? > > > > Tim.. > > > > > > > -Original Message- > From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] > Sent: 21 April 2010 23:29 > To: nutch-user@lucene.apache.org > Subject: RE: Is there some arbitrary limit on content stored for use by > summaries? > > Hi Tim, > > I would think that this parameter is related to the problem you > describe, but the default value should allow indexing pages of the size > you mention. Did you change this parameter? > > Regards, > > Arkadi > > > indexer.max.tokens > 1 > > The maximum number of tokens that will be indexed for a single field > in a document. This limits the amount of memory required for > indexing, so that collections with very large files will not crash > the indexing process by running out of memory. > > Note that this effectively truncates large documents, excluding > from the index tokens that occur further in the document. If you > know your source documents are large, be sure to set this value > high enough to accomodate the expected size. If you set it to > -1, then the only limit is your memory, but you should anticipate > an OutOfMemoryError. > > > > > -Original Message----- > > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk] > > Sent: Thursday, 22 April 2010 2:18 AM > > To: nutch-user@lucene.apache.org > > Subject: Is there some arbitrary limit on content stored for use by > > summaries? > > > > Hey, > > > > We have a long page that appears in the search results but the summary > > never contains the search terms. Why is this? > > > > If we move the text containing the search terms up the page they get > > displayed in the summary so it's obviously related to some limit > > imposed > > somewhere. I've looked though all the configuration options and none > > appear to change anything that sounds related to this. > > > > We use Nutch 1.0 and the the page in question is 8.7KB in size. > > > > > > Any help please? > > > > > > Tim.. > > > > > > > > > > > > > > Tim Redding > > Senior Java Developer > > Tribal DDB > > 12 Bishop's Bridge Road > > London W2 6AA > > T: +44 (0)20 7258 4517 I F: +44 (0)20 7258 4253 > > > > > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with > > its registered office situated at 12 Bishops Bridge Road, London W2 > > 6AA. > > __ > > This e-mail is intended only for the named person or entity to which > it > > is addressed and contains valuable business information that is > > privileged, confidential and/or otherwise protected from disclosure. > > Dissemination, distribution or copying of this e-mail or the > > information herein by anyone other than the intended recipient, or an > > employee, or agent responsible for delivering the message to the > > intended recipient, is strictly prohibited. All contents are the > > copyright property of the sender. If you are not the intended > > recipient, you are nevertheless bound to respect the sender's > worldwide > > legal rights. We require that unintended recipients delete the e-mail > > and destroy all electronic copies in their system, retaining no copies > > in any media. > > __ > > This email has been scanned by the MessageLabs Email Security System. > > For more information please visit http://www.messagelabs.com
Re: Is there some arbitrary limit on content stored for use by summaries?
Try refetching with a different value for : file.content.limit 65536 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Julien On 22 April 2010 18:44, Tim Redding wrote: > Hey Arkadi, > > I've tried upping the value to Integer.MAX_VALUE but it still doesn't > show a relevant summary. :-( > > Any other ideas? > > > > Tim.. > > > > > > > -Original Message- > From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] > Sent: 21 April 2010 23:29 > To: nutch-user@lucene.apache.org > Subject: RE: Is there some arbitrary limit on content stored for use by > summaries? > > Hi Tim, > > I would think that this parameter is related to the problem you > describe, but the default value should allow indexing pages of the size > you mention. Did you change this parameter? > > Regards, > > Arkadi > > > indexer.max.tokens > 1 > > The maximum number of tokens that will be indexed for a single field > in a document. This limits the amount of memory required for > indexing, so that collections with very large files will not crash > the indexing process by running out of memory. > > Note that this effectively truncates large documents, excluding > from the index tokens that occur further in the document. If you > know your source documents are large, be sure to set this value > high enough to accomodate the expected size. If you set it to > -1, then the only limit is your memory, but you should anticipate > an OutOfMemoryError. > > > > > -Original Message- > > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk] > > Sent: Thursday, 22 April 2010 2:18 AM > > To: nutch-user@lucene.apache.org > > Subject: Is there some arbitrary limit on content stored for use by > > summaries? > > > > Hey, > > > > We have a long page that appears in the search results but the summary > > never contains the search terms. Why is this? > > > > If we move the text containing the search terms up the page they get > > displayed in the summary so it's obviously related to some limit > > imposed > > somewhere. I've looked though all the configuration options and none > > appear to change anything that sounds related to this. > > > > We use Nutch 1.0 and the the page in question is 8.7KB in size. > > > > > > Any help please? > > > > > > Tim.. > > > > > > > > > > > > > > Tim Redding > > Senior Java Developer > > Tribal DDB > > 12 Bishop's Bridge Road > > London W2 6AA > > T: +44 (0)20 7258 4517 I F: +44 (0)20 7258 4253 > > > > > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with > > its registered office situated at 12 Bishops Bridge Road, London W2 > > 6AA. > > __ > > This e-mail is intended only for the named person or entity to which > it > > is addressed and contains valuable business information that is > > privileged, confidential and/or otherwise protected from disclosure. > > Dissemination, distribution or copying of this e-mail or the > > information herein by anyone other than the intended recipient, or an > > employee, or agent responsible for delivering the message to the > > intended recipient, is strictly prohibited. All contents are the > > copyright property of the sender. If you are not the intended > > recipient, you are nevertheless bound to respect the sender's > worldwide > > legal rights. We require that unintended recipients delete the e-mail > > and destroy all electronic copies in their system, retaining no copies > > in any media. > > __ > > This email has been scanned by the MessageLabs Email Security System. > > For more information please visit http://www.messagelabs.com/email > > __ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > __ > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its > registered office situated at 12 Bishops Bridge Road, London W2 6AA. > __ > This e-mail is intended only for the named person or entit
RE: Is there some arbitrary limit on content stored for use by summaries?
Hey Arkadi, I've tried upping the value to Integer.MAX_VALUE but it still doesn't show a relevant summary. :-( Any other ideas? Tim.. -Original Message- From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] Sent: 21 April 2010 23:29 To: nutch-user@lucene.apache.org Subject: RE: Is there some arbitrary limit on content stored for use by summaries? Hi Tim, I would think that this parameter is related to the problem you describe, but the default value should allow indexing pages of the size you mention. Did you change this parameter? Regards, Arkadi indexer.max.tokens 1 The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to -1, then the only limit is your memory, but you should anticipate an OutOfMemoryError. > -Original Message- > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk] > Sent: Thursday, 22 April 2010 2:18 AM > To: nutch-user@lucene.apache.org > Subject: Is there some arbitrary limit on content stored for use by > summaries? > > Hey, > > We have a long page that appears in the search results but the summary > never contains the search terms. Why is this? > > If we move the text containing the search terms up the page they get > displayed in the summary so it's obviously related to some limit > imposed > somewhere. I've looked though all the configuration options and none > appear to change anything that sounds related to this. > > We use Nutch 1.0 and the the page in question is 8.7KB in size. > > > Any help please? > > > Tim.. > > > > > > > Tim Redding > Senior Java Developer > Tribal DDB > 12 Bishop's Bridge Road > London W2 6AA > T: +44 (0)20 7258 4517 I F: +44 (0)20 7258 4253 > > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with > its registered office situated at 12 Bishops Bridge Road, London W2 > 6AA. > __ > This e-mail is intended only for the named person or entity to which it > is addressed and contains valuable business information that is > privileged, confidential and/or otherwise protected from disclosure. > Dissemination, distribution or copying of this e-mail or the > information herein by anyone other than the intended recipient, or an > employee, or agent responsible for delivering the message to the > intended recipient, is strictly prohibited. All contents are the > copyright property of the sender. If you are not the intended > recipient, you are nevertheless bound to respect the sender's worldwide > legal rights. We require that unintended recipients delete the e-mail > and destroy all electronic copies in their system, retaining no copies > in any media. > __ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its registered office situated at 12 Bishops Bridge Road, London W2 6AA. __ This e-mail is intended only for the named person or entity to which it is addressed and contains valuable business information that is privileged, confidential and/or otherwise protected from disclosure. Dissemination, distribution or copying of this e-mail or the information herein by anyone other than the intended recipient, or an employee, or agent responsible for delivering the message to the intended recipient, is strictly prohibited. All contents are the copyright property of the sender. If you are not the intended recipient, you are nevertheless bound to respect the sender's worldwide legal rights. We require that unintended recipients delete the e-mail and destroy all electronic copies in their system, retaining no copies in any media. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email
RE: Is there some arbitrary limit on content stored for use by summaries?
Hi Tim, I would think that this parameter is related to the problem you describe, but the default value should allow indexing pages of the size you mention. Did you change this parameter? Regards, Arkadi indexer.max.tokens 1 The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to -1, then the only limit is your memory, but you should anticipate an OutOfMemoryError. > -Original Message- > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk] > Sent: Thursday, 22 April 2010 2:18 AM > To: nutch-user@lucene.apache.org > Subject: Is there some arbitrary limit on content stored for use by > summaries? > > Hey, > > We have a long page that appears in the search results but the summary > never contains the search terms. Why is this? > > If we move the text containing the search terms up the page they get > displayed in the summary so it's obviously related to some limit > imposed > somewhere. I've looked though all the configuration options and none > appear to change anything that sounds related to this. > > We use Nutch 1.0 and the the page in question is 8.7KB in size. > > > Any help please? > > > Tim.. > > > > > > > Tim Redding > Senior Java Developer > Tribal DDB > 12 Bishop's Bridge Road > London W2 6AA > T: +44 (0)20 7258 4517 I F: +44 (0)20 7258 4253 > > > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with > its registered office situated at 12 Bishops Bridge Road, London W2 > 6AA. > __ > This e-mail is intended only for the named person or entity to which it > is addressed and contains valuable business information that is > privileged, confidential and/or otherwise protected from disclosure. > Dissemination, distribution or copying of this e-mail or the > information herein by anyone other than the intended recipient, or an > employee, or agent responsible for delivering the message to the > intended recipient, is strictly prohibited. All contents are the > copyright property of the sender. If you are not the intended > recipient, you are nevertheless bound to respect the sender's worldwide > legal rights. We require that unintended recipients delete the e-mail > and destroy all electronic copies in their system, retaining no copies > in any media. > __ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email
Is there some arbitrary limit on content stored for use by summaries?
Hey, We have a long page that appears in the search results but the summary never contains the search terms. Why is this? If we move the text containing the search terms up the page they get displayed in the summary so it's obviously related to some limit imposed somewhere. I've looked though all the configuration options and none appear to change anything that sounds related to this. We use Nutch 1.0 and the the page in question is 8.7KB in size. Any help please? Tim.. Tim Redding Senior Java Developer Tribal DDB 12 Bishop's Bridge Road London W2 6AA T: +44 (0)20 7258 4517 I F: +44 (0)20 7258 4253 Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its registered office situated at 12 Bishops Bridge Road, London W2 6AA. __ This e-mail is intended only for the named person or entity to which it is addressed and contains valuable business information that is privileged, confidential and/or otherwise protected from disclosure. Dissemination, distribution or copying of this e-mail or the information herein by anyone other than the intended recipient, or an employee, or agent responsible for delivering the message to the intended recipient, is strictly prohibited. All contents are the copyright property of the sender. If you are not the intended recipient, you are nevertheless bound to respect the sender's worldwide legal rights. We require that unintended recipients delete the e-mail and destroy all electronic copies in their system, retaining no copies in any media. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email