subject:"Indexing large documents"

Re: Problems for indexing large documents on SolrCloud

2014-09-22 Thread Olivier

Hi,

First thanks for your advices.
I did some several tests and finally I could index all the data on my
SolrCloud cluster.
The error was client side, it's documented in this post :
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201406.mbox/%3ccfc09ae1.94f8%25rebecca.t...@ucsf.edu%3E

EofException from Jetty means one specific thing: The client software
disconnected before Solr was finished with the request and sent its
response. Chances are good that this is because of a configured socket
timeout on your SolrJ client or its HttpClient. This might have been
done with the setSoTimeout method on the server object.

So I increased Solarium timeout from 5 to 60 seconds and all the data
is well indexed now. The error was not reproducible on my development
PC because the database and the Solr were on the same local virtual
machine with a lot of available resources so the indexation was faster
than in SolrCloud cluster.

Thanks,

Olivier

2014-09-11 0:21 GMT+02:00 Shawn Heisey s...@elyograg.org:

On 9/10/2014 2:05 PM, Erick Erickson wrote:
bq: org.apache.solr.common.SolrException: Unexpected end of input
block; expected an identifier

This is very often an indication that your packets are being
truncated by something in the chain. In your case, make sure
that Tomcat is configured to handle inputs of the size that you're
sending.

This may be happening before things get to Solr, in which case your
settings
in solrconfig.xml aren't germane, the problem is earlier than than.

A semi-smoking-gun here is that there's a size of your multivalued
field that seems to break things... That doesn't rule out time problems
of course.

But I'd look at the Tomcat settings for maximum packet size first.

The maximum HTTP request size is actually is controlled by Solr itself
since 4.1, with changes committed for SOLR-4265. Changing the setting
on Tomcat probably will not help.

An example from my own config which sets this to 32MB - the default is
2048, or 2MB:

requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=32768 formdataUploadLimitInKB=32768/

Thanks,
Shawn

Problems for indexing large documents on SolrCloud

2014-09-10 Thread Olivier

Hi,

I have some problems for indexing large documents in a SolrCloud cluster of
3 servers  (Solr 4.8.1) with 3 shards and 2 replicas for each shard on
Tomcat 7.
For a specific document (with 300 K values in a  multivalued field), I
couldn't index it on SolrCloud but I could do it in a single instance of
Solr on my own PC.

The indexation is done with Solarium from a database. The data indexed are
e-commerce products with classic fields like name, price, description,
instock, etc... The large field (type int) is constitued of other products
ids.
The only difference with other documents well-indexed on Solr  is the size
of that multivalued field. Indeed, other documents well-indexed have all
between 100K values and 200 K values for that field.
The index size is 11 Mb for 20 documents.

To solve it, I tried to change several parameters including ZKTimeout in
solr.xml  :

In solrcloud section :

int name=zkClientTimeout6/int

int name=distribUpdateConnTimeout10/int

int name=distribUpdateSoTimeout10/int



 In shardHandlerFactory section  :



int name=socketTimeout${socketTimeout:10}/int

int name=connTimeout${connTimeout:10}/int


I also tried to increase these values in solrconfig.xml :

requestParsers enableRemoteStreaming=true

multipartUploadLimitInKB=1

formdataUploadLimitInKB=10

addHttpRequestToContext=false/




I also tried to increase the quantity of RAM (there are VMs) : each server
has 4 Gb of RAM with 3Gb for the JVM.

Are there other settings which can solve the problem that I would have
forgotten ?


The error messages are :

ERROR

SolrDispatchFilter

null:java.lang.RuntimeException: [was class java.net.SocketException]
Connection reset

ERROR

SolrDispatchFilter

null:ClientAbortException:

java.net.SocketException:
broken pipe

ERROR

SolrDispatchFilter

null:ClientAbortException:

java.net.SocketException:
broken pipe

ERROR

SolrCore

org.apache.solr.common.SolrException:
  Unexpected end of input
block; expected an identifier

ERROR

SolrCore

org.apache.solr.common.SolrException:
  Unexpected end of input
block; expected an identifier

ERROR

SolrCore

org.apache.solr.common.SolrException:
  Unexpected end of input
block; expected an identifier

ERROR

SolrCore

org.apache.solr.common.SolrException:
  Unexpected EOF in
attribute value








Thanks,

Olivier

SolrCore

org.apache.solr.common.SolrException:
  Unexpected end of input
block in start tag

Re: Problems for indexing large documents on SolrCloud

2014-09-10 Thread Erick Erickson

bq: org.apache.solr.common.SolrException: Unexpected end of input
block; expected an identifier

This is very often an indication that your packets are being
truncated by something in the chain. In your case, make sure
that Tomcat is configured to handle inputs of the size that you're sending.

This may be happening before things get to Solr, in which case your settings
in solrconfig.xml aren't germane, the problem is earlier than than.

A semi-smoking-gun here is that there's a size of your multivalued
field that seems to break things... That doesn't rule out time problems
of course.

But I'd look at the Tomcat settings for maximum packet size first.

Best,
Erick

On Wed, Sep 10, 2014 at 9:11 AM, Olivier olivau...@gmail.com wrote:
 Hi,

 I have some problems for indexing large documents in a SolrCloud cluster of
 3 servers  (Solr 4.8.1) with 3 shards and 2 replicas for each shard on
 Tomcat 7.
 For a specific document (with 300 K values in a  multivalued field), I
 couldn't index it on SolrCloud but I could do it in a single instance of
 Solr on my own PC.

 The indexation is done with Solarium from a database. The data indexed are
 e-commerce products with classic fields like name, price, description,
 instock, etc... The large field (type int) is constitued of other products
 ids.
 The only difference with other documents well-indexed on Solr  is the size
 of that multivalued field. Indeed, other documents well-indexed have all
 between 100K values and 200 K values for that field.
 The index size is 11 Mb for 20 documents.

 To solve it, I tried to change several parameters including ZKTimeout in
 solr.xml  :

 In solrcloud section :

 int name=zkClientTimeout6/int

 int name=distribUpdateConnTimeout10/int

 int name=distribUpdateSoTimeout10/int



  In shardHandlerFactory section  :



 int name=socketTimeout${socketTimeout:10}/int

 int name=connTimeout${connTimeout:10}/int


 I also tried to increase these values in solrconfig.xml :

 requestParsers enableRemoteStreaming=true

 multipartUploadLimitInKB=1

 formdataUploadLimitInKB=10

 addHttpRequestToContext=false/




 I also tried to increase the quantity of RAM (there are VMs) : each server
 has 4 Gb of RAM with 3Gb for the JVM.

 Are there other settings which can solve the problem that I would have
 forgotten ?


 The error messages are :

 ERROR

 SolrDispatchFilter

 null:java.lang.RuntimeException: [was class java.net.SocketException]
 Connection reset

 ERROR

 SolrDispatchFilter

 null:ClientAbortException:

 java.net.SocketException:
 broken pipe

 ERROR

 SolrDispatchFilter

 null:ClientAbortException:

 java.net.SocketException:
 broken pipe

 ERROR

 SolrCore

 org.apache.solr.common.SolrException:
   Unexpected end of input
 block; expected an identifier

 ERROR

 SolrCore

 org.apache.solr.common.SolrException:
   Unexpected end of input
 block; expected an identifier

 ERROR

 SolrCore

 org.apache.solr.common.SolrException:
   Unexpected end of input
 block; expected an identifier

 ERROR

 SolrCore

 org.apache.solr.common.SolrException:
   Unexpected EOF in
 attribute value








 Thanks,

 Olivier

 SolrCore

 org.apache.solr.common.SolrException:
   Unexpected end of input
 block in start tag

Re: Problems for indexing large documents on SolrCloud

2014-09-10 Thread Shawn Heisey

On 9/10/2014 2:05 PM, Erick Erickson wrote:
 bq: org.apache.solr.common.SolrException: Unexpected end of input
 block; expected an identifier

 This is very often an indication that your packets are being
 truncated by something in the chain. In your case, make sure
 that Tomcat is configured to handle inputs of the size that you're sending.

 This may be happening before things get to Solr, in which case your settings
 in solrconfig.xml aren't germane, the problem is earlier than than.

 A semi-smoking-gun here is that there's a size of your multivalued
 field that seems to break things... That doesn't rule out time problems
 of course.

 But I'd look at the Tomcat settings for maximum packet size first.

The maximum HTTP request size is actually is controlled by Solr itself
since 4.1, with changes committed for SOLR-4265.  Changing the setting
on Tomcat probably will not help.

An example from my own config which sets this to 32MB - the default is
2048, or 2MB:

 requestParsers enableRemoteStreaming=false
multipartUploadLimitInKB=32768 formdataUploadLimitInKB=32768/

Thanks,
Shawn

Re: Indexing large documents

2014-03-19 Thread Alexei Martchenko

Even the most non-structured data has to have some breakpoint. I've seen
projects running solr that used to index whole books one document per
chapter plus a synopsis boosted doc. The question here is how you need to
search and match those docs.


alexei martchenko
Facebook http://www.facebook.com/alexeiramone |
Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
Steam http://steamcommunity.com/id/alexeiramone/ |
4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
Github https://github.com/alexeiramone | (11) 9 7613.0966 |


2014-03-18 23:52 GMT-03:00 Stephen Kottmann 
stephen_kottm...@h3biomedicine.com:

 Hi Solr Users,

 I'm looking for advice on best practices when indexing large documents
 (100's of MB or even 1 to 2 GB text files). I've been hunting around on
 google and the mailing list, and have found some suggestions of splitting
 the logical document up into multiple solr documents. However, I haven't
 been able to find anything that seems like conclusive advice.

 Some background...

 We've been using solr with great success for some time on a project that is
 mostly indexing very structured data - ie. mainly based on ingesting
 through DIH.

 I've now started a new project and we're trying to make use of solr again -
 however, in this project we are indexing mostly unstructured data - pdfs,
 powerpoint, word, etc. I've not done much configuration - my solr instance
 is very close to the example provided in the distribution aside from some
 minor schema changes. Our index is relatively small at this point ( ~3k
 documents ), and for initial indexing I am pulling documents from a http
 data source, running them through Tika, and then pushing to solr using
 solrj. For the most part this is working great... until I hit one of these
 huge text files and then OOM on indexing.

 I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
 it, but it seems like maybe there's a more robust solution that would scale
 better.

 Is splitting the logical document into multiple solr documents best
 practice here? If so, what are the considerations or pitfalls of doing this
 that I should be paying attention to. I guess when querying I always need
 to use a group by field to prevent multiple hits for the same document. Are
 there issues with term frequency, etc that you need to work around?

 Really interested to hear how others are dealing with this.

 Thanks everyone!
 Stephen

 --
 [This e-mail message may contain privileged, confidential and/or
 proprietary information of H3 Biomedicine. If you believe that it has been
 sent to you in error, please contact the sender immediately and delete the
 message including any attachments, without copying, using, or distributing
 any of the information contained therein. This e-mail message should not be
 interpreted to include a digital or electronic signature that can be used
 to authenticate an agreement, contract or other legal document, nor to
 reflect an intention to be bound to any legally-binding agreement or
 contract.]

Re: Indexing large documents

2014-03-19 Thread Tom Burton-West

Hi Stephen,

We regularly index documents in the range of 500KB-8GB on machines that
have about 10GB devoted to Solr.  In order to avoid OOM's on Solr versions
prior to Solr 4.0, we use a separate indexing machine(s) from the search
server machine(s) and also set the termIndexInterval to 8 times that of the
default 128
termIndexInterval1024/termIndexInterval (See
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for
a description of the problem, although the solution we are using is
different, termIndexInterval rather than termInfosDivisor)

I would like to second Otis' suggestion that you consider breaking large
documents into smaller sub-documents.   We are currently not doing that and
we believe that relevance ranking is not working well at all.

 If you consider that most relevance ranking algorithms were designed,
tested, and tuned on TREC newswire-size documents (average 300 words) or
truncated web documents (average 1,000-3,000 words), it seems likely that
they may not work well with book size documents (average 100,000 words).
 Ranking algorithms that use IDF will be particularly affected.


We are currently investigating grouping and block-join options.
Unfortunately, our data does not have good mark-up or metadata to allow
splitting books by chapter.  We have investigated indexing pages of books,
but  due to many issues including performance and scalability  (We index
the full-text of 11 million books and indexing on the page level it would
result in 3.3 billion solr documents), we haven't arrived at a workable
solution for our use case.   At the moment the main bottleneck is memory
use for faceting, but we intend to experiment with docValues to see if the
increase in index size is worth the reduction in memory use.

Presently block-join indexing does not implement scoring, although we hope
that will change in the near future and the relevance ranking for grouping
will rank the group by the highest ranking member.   So if you split a book
into chapters, it would rank the book by the highest ranking chapter.
 This may be appropriate for your use case as Otis suggested.  In our use
case sometimes this is appropriate, but we are investigating the
possibility of other methods of scoring the group based on a more flexible
function of the scores of the members (i.e scoring book based on function
of scores of chapters).

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search



On Tue, Mar 18, 2014 at 11:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 I think you probably want to split giant documents because you / your users
 probably want to be able to find smaller sections of those big docs that
 are best matches to their queries.  Imagine querying War and Peace.  Almost
 any regular word your query for will produce a match.  Yes, you may want to
 enable field collapsing aka grouping.  I've seen facet counts get messed up
 when grouping is turned on, but have not confirmed if this is a (known) bug
 or not.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann 
 stephen_kottm...@h3biomedicine.com wrote:

  Hi Solr Users,
 
  I'm looking for advice on best practices when indexing large documents
  (100's of MB or even 1 to 2 GB text files). I've been hunting around on
  google and the mailing list, and have found some suggestions of splitting
  the logical document up into multiple solr documents. However, I haven't
  been able to find anything that seems like conclusive advice.
 
  Some background...
 
  We've been using solr with great success for some time on a project that
 is
  mostly indexing very structured data - ie. mainly based on ingesting
  through DIH.
 
  I've now started a new project and we're trying to make use of solr
 again -
  however, in this project we are indexing mostly unstructured data - pdfs,
  powerpoint, word, etc. I've not done much configuration - my solr
 instance
  is very close to the example provided in the distribution aside from some
  minor schema changes. Our index is relatively small at this point ( ~3k
  documents ), and for initial indexing I am pulling documents from a http
  data source, running them through Tika, and then pushing to solr using
  solrj. For the most part this is working great... until I hit one of
 these
  huge text files and then OOM on indexing.
 
  I've got a modest JVM - 4GB allocated. Obviously I can throw more memory
 at
  it, but it seems like maybe there's a more robust solution that would
 scale
  better.
 
  Is splitting the logical document into multiple solr documents best
  practice here? If so, what are the considerations or pitfalls of doing
 this
  that I should be paying attention to. I guess when querying I always need
  to use a group by field to prevent multiple hits for the same document.
 Are
  there issues with term frequency, etc that you need

Indexing large documents

2014-03-18 Thread Stephen Kottmann

Hi Solr Users,

I'm looking for advice on best practices when indexing large documents
(100's of MB or even 1 to 2 GB text files). I've been hunting around on
google and the mailing list, and have found some suggestions of splitting
the logical document up into multiple solr documents. However, I haven't
been able to find anything that seems like conclusive advice.

Some background...

We've been using solr with great success for some time on a project that is
mostly indexing very structured data - ie. mainly based on ingesting
through DIH.

I've now started a new project and we're trying to make use of solr again -
however, in this project we are indexing mostly unstructured data - pdfs,
powerpoint, word, etc. I've not done much configuration - my solr instance
is very close to the example provided in the distribution aside from some
minor schema changes. Our index is relatively small at this point ( ~3k
documents ), and for initial indexing I am pulling documents from a http
data source, running them through Tika, and then pushing to solr using
solrj. For the most part this is working great... until I hit one of these
huge text files and then OOM on indexing.

I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
it, but it seems like maybe there's a more robust solution that would scale
better.

Is splitting the logical document into multiple solr documents best
practice here? If so, what are the considerations or pitfalls of doing this
that I should be paying attention to. I guess when querying I always need
to use a group by field to prevent multiple hits for the same document. Are
there issues with term frequency, etc that you need to work around?

Really interested to hear how others are dealing with this.

Thanks everyone!
Stephen

-- 
[This e-mail message may contain privileged, confidential and/or 
proprietary information of H3 Biomedicine. If you believe that it has been 
sent to you in error, please contact the sender immediately and delete the 
message including any attachments, without copying, using, or distributing 
any of the information contained therein. This e-mail message should not be 
interpreted to include a digital or electronic signature that can be used 
to authenticate an agreement, contract or other legal document, nor to 
reflect an intention to be bound to any legally-binding agreement or 
contract.]

Re: Indexing large documents

2014-03-18 Thread Otis Gospodnetic

Hi,

I think you probably want to split giant documents because you / your users
probably want to be able to find smaller sections of those big docs that
are best matches to their queries.  Imagine querying War and Peace.  Almost
any regular word your query for will produce a match.  Yes, you may want to
enable field collapsing aka grouping.  I've seen facet counts get messed up
when grouping is turned on, but have not confirmed if this is a (known) bug
or not.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann 
stephen_kottm...@h3biomedicine.com wrote:

 Hi Solr Users,

 I'm looking for advice on best practices when indexing large documents
 (100's of MB or even 1 to 2 GB text files). I've been hunting around on
 google and the mailing list, and have found some suggestions of splitting
 the logical document up into multiple solr documents. However, I haven't
 been able to find anything that seems like conclusive advice.

 Some background...

 We've been using solr with great success for some time on a project that is
 mostly indexing very structured data - ie. mainly based on ingesting
 through DIH.

 I've now started a new project and we're trying to make use of solr again -
 however, in this project we are indexing mostly unstructured data - pdfs,
 powerpoint, word, etc. I've not done much configuration - my solr instance
 is very close to the example provided in the distribution aside from some
 minor schema changes. Our index is relatively small at this point ( ~3k
 documents ), and for initial indexing I am pulling documents from a http
 data source, running them through Tika, and then pushing to solr using
 solrj. For the most part this is working great... until I hit one of these
 huge text files and then OOM on indexing.

 I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
 it, but it seems like maybe there's a more robust solution that would scale
 better.

 Is splitting the logical document into multiple solr documents best
 practice here? If so, what are the considerations or pitfalls of doing this
 that I should be paying attention to. I guess when querying I always need
 to use a group by field to prevent multiple hits for the same document. Are
 there issues with term frequency, etc that you need to work around?

 Really interested to hear how others are dealing with this.

 Thanks everyone!
 Stephen

 --
 [This e-mail message may contain privileged, confidential and/or
 proprietary information of H3 Biomedicine. If you believe that it has been
 sent to you in error, please contact the sender immediately and delete the
 message including any attachments, without copying, using, or distributing
 any of the information contained therein. This e-mail message should not be
 interpreted to include a digital or electronic signature that can be used
 to authenticate an agreement, contract or other legal document, nor to
 reflect an intention to be bound to any legally-binding agreement or
 contract.]

Re: Is indexing large documents still an issue?

2013-05-06 Thread Bai Shen

You can still use highlighting without returning the content. Just set
content as your alternate highlight field. Then if no highlights are
returned you will receive the content. Make sure you set a character limit
so you don't get the whole thing. I use 300.

Does that make sense? This is what I add to my query string.

hl=truehl.fl=contenthl.snippets=3hl.alternateField=contenthl.maxAlternateFieldLength=300

On Thu, May 2, 2013 at 7:32 AM, adfel70 adfe...@gmail.com wrote:

Well, returning the content field for highlighting is within my
requirements.
Did you solve this in some other way? or you just didn't have to?

Bai Shen wrote
The only issue I ran into was returning the content field. Once I
modified
my query to avoid that, I got good performance.

Admittedly, I only have about 15-20k documents in my index ATM, but most
of
them are in the multiMB range with a current max of 250MB.

On Thu, May 2, 2013 at 7:05 AM, adfel70 lt;

adfel70@

gt; wrote:

Hi,
In previous versions of solr, indexing documents with large fields
caused
performance degradation.

Is this still the case in solr 4.2?

If so, and I'll need to chunk the document and index many document
parts,
can anyony give a general idea of what field/document size solr CAN
handle?

thanks.

--
View this message in context:

http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425p4060431.html
Sent from the Solr - User mailing list archive at Nabble.com.

Is indexing large documents still an issue?

2013-05-02 Thread adfel70

Hi,
In previous versions of solr, indexing documents with large fields caused
performance degradation.

Is this still the case in solr 4.2?

If so, and I'll need to chunk the document and index many document parts,
can anyony give a general idea of what field/document size solr CAN handle?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is indexing large documents still an issue?

2013-05-02 Thread Bai Shen

The only issue I ran into was returning the content field.  Once I modified
my query to avoid that, I got good performance.

Admittedly, I only have about 15-20k documents in my index ATM, but most of
them are in the multiMB range with a current max of 250MB.


On Thu, May 2, 2013 at 7:05 AM, adfel70 adfe...@gmail.com wrote:

 Hi,
 In previous versions of solr, indexing documents with large fields caused
 performance degradation.

 Is this still the case in solr 4.2?

 If so, and I'll need to chunk the document and index many document parts,
 can anyony give a general idea of what field/document size solr CAN handle?

 thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is indexing large documents still an issue?

2013-05-02 Thread adfel70

Well, returning the content field for highlighting is within my requirements.
Did you solve this in some other way? or you just didn't have to?


Bai Shen wrote
 The only issue I ran into was returning the content field.  Once I
 modified
 my query to avoid that, I got good performance.
 
 Admittedly, I only have about 15-20k documents in my index ATM, but most
 of
 them are in the multiMB range with a current max of 250MB.
 
 
 On Thu, May 2, 2013 at 7:05 AM, adfel70 lt;

 adfel70@

 gt; wrote:
 
 Hi,
 In previous versions of solr, indexing documents with large fields caused
 performance degradation.

 Is this still the case in solr 4.2?

 If so, and I'll need to chunk the document and index many document parts,
 can anyony give a general idea of what field/document size solr CAN
 handle?

 thanks.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425.html
 Sent from the Solr - User mailing list archive at Nabble.com.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-indexing-large-documents-still-an-issue-tp4060425p4060431.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing large documents

2007-08-20 Thread Fouad Mardini

Hello,

I am using solr to index text extracted from word documents, and it is
working really well.
Recently i started noticing that some documents are not indexed, that is i
know that the word foobar is in a document, but when i search for foobar the
id of that document is not returned.
I suspect that this has to do with the size of the document, and that
documents with a lot of text are not being indexed.
Please advise.

thanks,
fmardini

RE: Indexing large documents

2007-08-20 Thread praveen jain

Hi 
 I want to know how to update my .xml file which have other field then
the  default one , so which file o have to modify, and how.

pRAVEEN jAIN
+919890599250

-Original Message-
From: Fouad Mardini [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 20, 2007 4:00 PM
To: solr-user@lucene.apache.org
Subject: Indexing large documents

Hello,

I am using solr to index text extracted from word documents, and it is
working really well.
Recently i started noticing that some documents are not indexed, that is
i
know that the word foobar is in a document, but when i search for foobar
the
id of that document is not returned.
I suspect that this has to do with the size of the document, and that
documents with a lot of text are not being indexed.
Please advise.

thanks,
fmardini

Re: Indexing large documents

2007-08-20 Thread Peter Manis

Fouad,

I would check the error log or console for any possible errors first.
They may not show up, it really depends on how you are processing the
word document (custom solr, feeding the text to it, etc).  We are
using a custom version of solr with PDF, DOC, XLS, etc text extraction
and I have successfully indexed 40mb documents.  I did have indexing
problems with a large document or two and simply increasing the heap
size fixed the problem.

 - Pete

On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
 Hello,

 I am using solr to index text extracted from word documents, and it is
 working really well.
 Recently i started noticing that some documents are not indexed, that is i
 know that the word foobar is in a document, but when i search for foobar the
 id of that document is not returned.
 I suspect that this has to do with the size of the document, and that
 documents with a lot of text are not being indexed.
 Please advise.

 thanks,
 fmardini

Re: Indexing large documents

2007-08-20 Thread Fouad Mardini

Well, I am using the java textmining library to extract text from documents,
then i do a post to solr
I do not have an error log, i only have *.request.log files in the logs
directory

Thanks

On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote:

 Fouad,

 I would check the error log or console for any possible errors first.
 They may not show up, it really depends on how you are processing the
 word document (custom solr, feeding the text to it, etc).  We are
 using a custom version of solr with PDF, DOC, XLS, etc text extraction
 and I have successfully indexed 40mb documents.  I did have indexing
 problems with a large document or two and simply increasing the heap
 size fixed the problem.

 - Pete

 On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
  Hello,
 
  I am using solr to index text extracted from word documents, and it is
  working really well.
  Recently i started noticing that some documents are not indexed, that is
 i
  know that the word foobar is in a document, but when i search for foobar
 the
  id of that document is not returned.
  I suspect that this has to do with the size of the document, and that
  documents with a lot of text are not being indexed.
  Please advise.
 
  thanks,
  fmardini

Re: Indexing large documents

2007-08-20 Thread Peter Manis

The that should show some errors if something goes wrong, if not the
console usually will.  The errors will look like a java stacktrace
output.  Did increasing the heap do anything for you?  Changing mine
to 256mb max worked fine for all of our files.

On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
 Well, I am using the java textmining library to extract text from documents,
 then i do a post to solr
 I do not have an error log, i only have *.request.log files in the logs
 directory

 Thanks

 On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote:
 
  Fouad,
 
  I would check the error log or console for any possible errors first.
  They may not show up, it really depends on how you are processing the
  word document (custom solr, feeding the text to it, etc).  We are
  using a custom version of solr with PDF, DOC, XLS, etc text extraction
  and I have successfully indexed 40mb documents.  I did have indexing
  problems with a large document or two and simply increasing the heap
  size fixed the problem.
 
  - Pete
 
  On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
   Hello,
  
   I am using solr to index text extracted from word documents, and it is
   working really well.
   Recently i started noticing that some documents are not indexed, that is
  i
   know that the word foobar is in a document, but when i search for foobar
  the
   id of that document is not returned.
   I suspect that this has to do with the size of the document, and that
   documents with a lot of text are not being indexed.
   Please advise.
  
   thanks,
   fmardini

Re: Indexing large documents

2007-08-20 Thread Pieter Berkel

You will probably need to increase the value of maxFieldLength in your
solrconfig.xml.  The default value is 1 which might explain why your
documents are not being completely indexed.

Piete


On 20/08/07, Peter Manis [EMAIL PROTECTED] wrote:

 The that should show some errors if something goes wrong, if not the
 console usually will.  The errors will look like a java stacktrace
 output.  Did increasing the heap do anything for you?  Changing mine
 to 256mb max worked fine for all of our files.

 On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
  Well, I am using the java textmining library to extract text from
 documents,
  then i do a post to solr
  I do not have an error log, i only have *.request.log files in the logs
  directory
 
  Thanks
 
  On 8/20/07, Peter Manis [EMAIL PROTECTED] wrote:
  
   Fouad,
  
   I would check the error log or console for any possible errors first.
   They may not show up, it really depends on how you are processing the
   word document (custom solr, feeding the text to it, etc).  We are
   using a custom version of solr with PDF, DOC, XLS, etc text extraction
   and I have successfully indexed 40mb documents.  I did have indexing
   problems with a large document or two and simply increasing the heap
   size fixed the problem.
  
   - Pete
  
   On 8/20/07, Fouad Mardini [EMAIL PROTECTED] wrote:
Hello,
   
I am using solr to index text extracted from word documents, and it
 is
working really well.
Recently i started noticing that some documents are not indexed,
 that is
   i
know that the word foobar is in a document, but when i search for
 foobar
   the
id of that document is not returned.
I suspect that this has to do with the size of the document, and
 that
documents with a lot of text are not being indexed.
Please advise.
   
thanks,
fmardini

Re: Problems for indexing large documents on SolrCloud

Problems for indexing large documents on SolrCloud

Re: Problems for indexing large documents on SolrCloud

Re: Problems for indexing large documents on SolrCloud

Re: Indexing large documents

Re: Indexing large documents

Indexing large documents

Re: Indexing large documents

Re: Is indexing large documents still an issue?

Is indexing large documents still an issue?

Re: Is indexing large documents still an issue?

Re: Is indexing large documents still an issue?

Indexing large documents

RE: Indexing large documents

Re: Indexing large documents

Re: Indexing large documents

Re: Indexing large documents

Re: Indexing large documents

18 matches

Site Navigation

Mail list logo

Footer information