RE: Is there some arbitrary limit on content stored for use by summaries?

2010-04-23 Thread Tim Redding
Yeah I've played with that.  My current value for file.content.limit is
10.  That's considerably longer than the page I'm having
problems with.

Its fast approaching that time where I have to split the page in to lots
of smaller pages. :-(  Thankfully owning the site that we use nutch on
appears to be the only solution to this summary issue.


Tim.. 

-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Sent: 22 April 2010 21:56
To: nutch-user@lucene.apache.org
Subject: Re: Is there some arbitrary limit on content stored for use by
summaries?

Try refetching with a different value for :


  file.content.limit
  65536
  The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  


Julien

On 22 April 2010 18:44, Tim Redding  wrote:

> Hey Arkadi,
>
> I've tried upping the value to Integer.MAX_VALUE but it still doesn't
> show a relevant summary. :-(
>
> Any other ideas?
>
>
>
> Tim..
>
>
>
>
>
>
> -Original Message-
> From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> Sent: 21 April 2010 23:29
> To: nutch-user@lucene.apache.org
> Subject: RE: Is there some arbitrary limit on content stored for use
by
> summaries?
>
> Hi Tim,
>
> I would think that this parameter is related to the problem you
> describe, but the default value should allow indexing pages of the
size
> you mention. Did you change this parameter?
>
> Regards,
>
> Arkadi
>
> 
>  indexer.max.tokens
>  1
>  
>  The maximum number of tokens that will be indexed for a single field
>  in a document. This limits the amount of memory required for
>  indexing, so that collections with very large files will not crash
>  the indexing process by running out of memory.
>
>  Note that this effectively truncates large documents, excluding
>  from the index tokens that occur further in the document. If you
>  know your source documents are large, be sure to set this value
>  high enough to accomodate the expected size. If you set it to
>  -1, then the only limit is your memory, but you should anticipate
>  an OutOfMemoryError.
>  
> 
>
> > -Original Message-----
> > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> > Sent: Thursday, 22 April 2010 2:18 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Is there some arbitrary limit on content stored for use by
> > summaries?
> >
> > Hey,
> >
> > We have a long page that appears in the search results but the
summary
> > never contains the search terms.  Why is this?
> >
> > If we move the text containing the search terms up the page they get
> > displayed in the summary so it's obviously related to some limit
> > imposed
> > somewhere.  I've looked though all the configuration options and
none
> > appear to change anything that sounds related to this.
> >
> > We use Nutch 1.0 and the the page in question is 8.7KB in size.
> >
> >
> > Any help please?
> >
> >
> > Tim..
> >
> >
> >
> >
> >
> >
> > Tim Redding
> > Senior Java Developer
> > Tribal DDB
> > 12 Bishop's Bridge Road
> > London W2 6AA
> > T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> >
> >
> > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> > its registered office situated at 12 Bishops Bridge Road, London W2
> > 6AA.
> >
__
> > This e-mail is intended only for the named person or entity to which
> it
> > is addressed and contains valuable business information that is
> > privileged, confidential and/or otherwise protected from disclosure.
> > Dissemination, distribution or copying of this e-mail or the
> > information herein by anyone other than the intended recipient, or
an
> > employee, or agent responsible for delivering the message to the
> > intended recipient, is strictly prohibited. All contents are the
> > copyright property of the sender. If you are not the intended
> > recipient, you are nevertheless bound to respect the sender's
> worldwide
> > legal rights. We require that unintended recipients delete the
e-mail
> > and destroy all electronic copies in their system, retaining no
copies
> > in any media.
> >
__
> > This email has been scanned by the MessageLabs Email Security
System.
> > For more information please visit http://www.messagelabs.com

Re: Is there some arbitrary limit on content stored for use by summaries?

2010-04-22 Thread Julien Nioche
Try refetching with a different value for :


  file.content.limit
  65536
  The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  


Julien

On 22 April 2010 18:44, Tim Redding  wrote:

> Hey Arkadi,
>
> I've tried upping the value to Integer.MAX_VALUE but it still doesn't
> show a relevant summary. :-(
>
> Any other ideas?
>
>
>
> Tim..
>
>
>
>
>
>
> -Original Message-
> From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> Sent: 21 April 2010 23:29
> To: nutch-user@lucene.apache.org
> Subject: RE: Is there some arbitrary limit on content stored for use by
> summaries?
>
> Hi Tim,
>
> I would think that this parameter is related to the problem you
> describe, but the default value should allow indexing pages of the size
> you mention. Did you change this parameter?
>
> Regards,
>
> Arkadi
>
> 
>  indexer.max.tokens
>  1
>  
>  The maximum number of tokens that will be indexed for a single field
>  in a document. This limits the amount of memory required for
>  indexing, so that collections with very large files will not crash
>  the indexing process by running out of memory.
>
>  Note that this effectively truncates large documents, excluding
>  from the index tokens that occur further in the document. If you
>  know your source documents are large, be sure to set this value
>  high enough to accomodate the expected size. If you set it to
>  -1, then the only limit is your memory, but you should anticipate
>  an OutOfMemoryError.
>  
> 
>
> > -Original Message-
> > From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> > Sent: Thursday, 22 April 2010 2:18 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Is there some arbitrary limit on content stored for use by
> > summaries?
> >
> > Hey,
> >
> > We have a long page that appears in the search results but the summary
> > never contains the search terms.  Why is this?
> >
> > If we move the text containing the search terms up the page they get
> > displayed in the summary so it's obviously related to some limit
> > imposed
> > somewhere.  I've looked though all the configuration options and none
> > appear to change anything that sounds related to this.
> >
> > We use Nutch 1.0 and the the page in question is 8.7KB in size.
> >
> >
> > Any help please?
> >
> >
> > Tim..
> >
> >
> >
> >
> >
> >
> > Tim Redding
> > Senior Java Developer
> > Tribal DDB
> > 12 Bishop's Bridge Road
> > London W2 6AA
> > T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> >
> >
> > Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> > its registered office situated at 12 Bishops Bridge Road, London W2
> > 6AA.
> > __
> > This e-mail is intended only for the named person or entity to which
> it
> > is addressed and contains valuable business information that is
> > privileged, confidential and/or otherwise protected from disclosure.
> > Dissemination, distribution or copying of this e-mail or the
> > information herein by anyone other than the intended recipient, or an
> > employee, or agent responsible for delivering the message to the
> > intended recipient, is strictly prohibited. All contents are the
> > copyright property of the sender. If you are not the intended
> > recipient, you are nevertheless bound to respect the sender's
> worldwide
> > legal rights. We require that unintended recipients delete the e-mail
> > and destroy all electronic copies in their system, retaining no copies
> > in any media.
> > __
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email
>
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> __
>
> Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its
> registered office situated at 12 Bishops Bridge Road, London W2 6AA.
> __
> This e-mail is intended only for the named person or entit

RE: Is there some arbitrary limit on content stored for use by summaries?

2010-04-22 Thread Tim Redding
Hey Arkadi,

I've tried upping the value to Integer.MAX_VALUE but it still doesn't
show a relevant summary. :-(

Any other ideas?



Tim..






-Original Message-
From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au] 
Sent: 21 April 2010 23:29
To: nutch-user@lucene.apache.org
Subject: RE: Is there some arbitrary limit on content stored for use by
summaries?

Hi Tim,

I would think that this parameter is related to the problem you
describe, but the default value should allow indexing pages of the size
you mention. Did you change this parameter?

Regards,

Arkadi


  indexer.max.tokens
  1
  
  The maximum number of tokens that will be indexed for a single field
  in a document. This limits the amount of memory required for
  indexing, so that collections with very large files will not crash
  the indexing process by running out of memory.

  Note that this effectively truncates large documents, excluding
  from the index tokens that occur further in the document. If you
  know your source documents are large, be sure to set this value
  high enough to accomodate the expected size. If you set it to
  -1, then the only limit is your memory, but you should anticipate
  an OutOfMemoryError.
  


> -Original Message-
> From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> Sent: Thursday, 22 April 2010 2:18 AM
> To: nutch-user@lucene.apache.org
> Subject: Is there some arbitrary limit on content stored for use by
> summaries?
> 
> Hey,
> 
> We have a long page that appears in the search results but the summary
> never contains the search terms.  Why is this?
> 
> If we move the text containing the search terms up the page they get
> displayed in the summary so it's obviously related to some limit
> imposed
> somewhere.  I've looked though all the configuration options and none
> appear to change anything that sounds related to this.
> 
> We use Nutch 1.0 and the the page in question is 8.7KB in size.
> 
> 
> Any help please?
> 
> 
> Tim..
> 
> 
> 
> 
> 
> 
> Tim Redding
> Senior Java Developer
> Tribal DDB
> 12 Bishop's Bridge Road
> London W2 6AA
> T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> 
> 
> Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> its registered office situated at 12 Bishops Bridge Road, London W2
> 6AA.
> __
> This e-mail is intended only for the named person or entity to which
it
> is addressed and contains valuable business information that is
> privileged, confidential and/or otherwise protected from disclosure.
> Dissemination, distribution or copying of this e-mail or the
> information herein by anyone other than the intended recipient, or an
> employee, or agent responsible for delivering the message to the
> intended recipient, is strictly prohibited. All contents are the
> copyright property of the sender. If you are not the intended
> recipient, you are nevertheless bound to respect the sender's
worldwide
> legal rights. We require that unintended recipients delete the e-mail
> and destroy all electronic copies in their system, retaining no copies
> in any media.
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__

Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its 
registered office situated at 12 Bishops Bridge Road, London W2 6AA.
__
This e-mail is intended only for the named person or entity to which it is 
addressed and contains valuable business information that is privileged, 
confidential and/or otherwise protected from disclosure. Dissemination, 
distribution or copying of this e-mail or the information herein by anyone 
other than the intended recipient, or an employee, or agent responsible for 
delivering the message to the intended recipient, is strictly prohibited. All 
contents are the copyright property of the sender. If you are not the intended 
recipient, you are nevertheless bound to respect the sender's worldwide legal 
rights. We require that unintended recipients delete the e-mail and destroy all 
electronic copies in their system, retaining no copies in any media. 
__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email


RE: Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Arkadi.Kosmynin
Hi Tim,

I would think that this parameter is related to the problem you describe, but 
the default value should allow indexing pages of the size you mention. Did you 
change this parameter?

Regards,

Arkadi


  indexer.max.tokens
  1
  
  The maximum number of tokens that will be indexed for a single field
  in a document. This limits the amount of memory required for
  indexing, so that collections with very large files will not crash
  the indexing process by running out of memory.

  Note that this effectively truncates large documents, excluding
  from the index tokens that occur further in the document. If you
  know your source documents are large, be sure to set this value
  high enough to accomodate the expected size. If you set it to
  -1, then the only limit is your memory, but you should anticipate
  an OutOfMemoryError.
  


> -Original Message-
> From: Tim Redding [mailto:tim.redd...@tribalddb.co.uk]
> Sent: Thursday, 22 April 2010 2:18 AM
> To: nutch-user@lucene.apache.org
> Subject: Is there some arbitrary limit on content stored for use by
> summaries?
> 
> Hey,
> 
> We have a long page that appears in the search results but the summary
> never contains the search terms.  Why is this?
> 
> If we move the text containing the search terms up the page they get
> displayed in the summary so it's obviously related to some limit
> imposed
> somewhere.  I've looked though all the configuration options and none
> appear to change anything that sounds related to this.
> 
> We use Nutch 1.0 and the the page in question is 8.7KB in size.
> 
> 
> Any help please?
> 
> 
> Tim..
> 
> 
> 
> 
> 
> 
> Tim Redding
> Senior Java Developer
> Tribal DDB
> 12 Bishop's Bridge Road
> London W2 6AA
> T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
> 
> 
> Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with
> its registered office situated at 12 Bishops Bridge Road, London W2
> 6AA.
> __
> This e-mail is intended only for the named person or entity to which it
> is addressed and contains valuable business information that is
> privileged, confidential and/or otherwise protected from disclosure.
> Dissemination, distribution or copying of this e-mail or the
> information herein by anyone other than the intended recipient, or an
> employee, or agent responsible for delivering the message to the
> intended recipient, is strictly prohibited. All contents are the
> copyright property of the sender. If you are not the intended
> recipient, you are nevertheless bound to respect the sender's worldwide
> legal rights. We require that unintended recipients delete the e-mail
> and destroy all electronic copies in their system, retaining no copies
> in any media.
> __
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email


Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Tim Redding
Hey,
 
We have a long page that appears in the search results but the summary
never contains the search terms.  Why is this?
 
If we move the text containing the search terms up the page they get
displayed in the summary so it's obviously related to some limit imposed
somewhere.  I've looked though all the configuration options and none
appear to change anything that sounds related to this.
 
We use Nutch 1.0 and the the page in question is 8.7KB in size.
 
 
Any help please?
 
 
Tim..
 
 
 
 
 
 
Tim Redding
Senior Java Developer
Tribal DDB
12 Bishop's Bridge Road
London W2 6AA
T: +44 (0)20 7258 4517  I  F: +44 (0)20 7258 4253
 

Tribal DDB, a division of DDB UK Limited, Company No. 00933578, with its 
registered office situated at 12 Bishops Bridge Road, London W2 6AA.
__
This e-mail is intended only for the named person or entity to which it is 
addressed and contains valuable business information that is privileged, 
confidential and/or otherwise protected from disclosure. Dissemination, 
distribution or copying of this e-mail or the information herein by anyone 
other than the intended recipient, or an employee, or agent responsible for 
delivering the message to the intended recipient, is strictly prohibited. All 
contents are the copyright property of the sender. If you are not the intended 
recipient, you are nevertheless bound to respect the sender's worldwide legal 
rights. We require that unintended recipients delete the e-mail and destroy all 
electronic copies in their system, retaining no copies in any media. 
__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email