RE: Indexing very large files.

Jon Lehto Sun, 24 Feb 2008 06:22:31 -0800

Hi Dave,

A couple more thoughts -

Security is separate from file size.
Maybe assign your users to membership classes, which will cut down
the amount of updating needed over time, as people enter/leave/change roles.
For instance, 'Bob' was in operations with full access, moved to techsupport
with access restricted to a major customers content, then moved to support
their direct competitor with no access to the 1st and full to the
competitor. Ideally, changing Bob's permissions can be done, without
touching the index. There are commercial products available for this sort of
thing. Netegrity Siteminder is one, who had the largest market share. Maybe
read how they handle it, and based on resources build/buy what you need. The
biggest work (not that big) is the integration with
permissions/Single-Sign-On system - LDAP or other. Indexed docs, just get a
security token. Maybe read about Access Control Lists, if you've not worked
with them before.

Breaking big files can be done blindly during index loading, with a scheme
which clues users or UI on how to access other sections. Document conversion
to text should be in the indexing pipeline. 

Another approach could be to index a summary and point to the large doc
in the file system or database.

Cheers,
Jon

-----Original Message-----
From: David Thibault [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 23, 2008 9:50 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing very large files.

Thanks.  I'm trying to do a general purpose secure enterprise search system.
 Specifically, it needs to be able to crawl web pages (which are almost all
small files) and filesystems (which may have widely varying file sizes).  I
realize other projects exist that have done similar, but none take into
account the original file permissions, index those too, and then limit
search results to documents that the searching party should have access to
(and hiding results that the searcher should not have access to).  Since the
types of files are not known in advance, I can't exactly split them up into
logical units.  I could possibly just limit my indexing to the first X mb of
any file, though.  I hadn't thought of the implications for relevance or
post-processing that you bring up above.
Thanks,
Dave

On 2/23/08, Jon Lehto <[EMAIL PROTECTED]> wrote:
>
> Dave
>
> You may want to break large docs into chunks, say by chapter or other
> logical segment.
>
> This will help in
>   - relevance ranking - the term frequency of large docs will cause
>    uneven weighting unless the relevance calculation does log
> normalization
>   - finer granularity of retrieval - for example a dictionary, thesaurus,
> and
>    Encyclopedia probably have what you want, but how to get it quickly?
>   - post-processing - like high-lighting, can be a performance killer, as
> the
>    search/replace scans the entire large file for matching strings
>
>
> Jon
>
>
> -----Original Message-----
> From: David Thibault [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 21, 2008 7:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing very large files.
>
> All,
> A while back I was running into an issue with a Java heap out of memory
> error while indexing large files.  I figured out that was my own error due
> to a misconfiguration of my Netbeans memory settings.
>
> However, now that is fixed and I have stumbled upon a new error.  When
> trying to upload files which include a Solr TextField value of 32MB or
> more
> in size, I get the following error (uploading with SimplePostTool):
>
>
> Solr returned an error: error reading input, returned 0
> javax.xml.stream.XMLStreamException: error reading input, returned 0  at
> com.bea.xml.stream.MXParser.fillBuf(MXParser.java:3709)  at
> com.bea.xml.stream.MXParser.more(MXParser.java:3715)  at
> com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1936)  at
> com.bea.xml.stream.MXParser.next(MXParser.java:1333)  at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(
> XmlUpdateRequestHandler.java:318)  at
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(
> XmlUpdateRequestHandler.java:195)  at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> XmlUpdateRequestHandler.java:123)  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:117)  at org.apache.solr.core.SolrCore.execute(
> SolrCore.java:902)  at org.apache.solr.servlet.SolrDispatchFilter.execute(
> SolrDispatchFilter.java:280)  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:
> 237)
>   at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> ApplicationFilterChain.java:235)  at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:206)  at
> org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:233)  at
> org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:175)  at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
> :128
> )
>   at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
> :102
> )
>   at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:109)  at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> :286)
>   at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>   at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
> Http11Protocol.java:583)  at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java
> :447)  at
> java.lang.Thread.run(Thread.java:613)
>
> I suspect there's a setting somewhere that I'm overlooking that is causing
> this, but after peering through the solrconfig.xml and schema.xml files I
> am
> not seeing anything obvious (to me, anyway...=).  The second line of the
> error shows it's crashing in MXParser.fillBuf, which implies that I'm
> overloading the buffer (I assume due to too large of a string).
>
> Thanks in advance for any assistance,
> Dave
>
>

RE: Indexing very large files.

Reply via email to