Retrieval by ID would only be one possible case; I'm still at the beginning of the project, I imagine adding more fields for more complicated queries in the future. I imagine a "where - like" query over all the XML documents stored in a DBMS wouldn't be too performant ;)

And at a later stage I will process all these documents and add lots of metadata - then by latest, I will need a Lucene Index rather than a database. So I'd by interested in solution ideas to my issue all the same.

Regards,

    Erik

Am 16.11.2010 11:35, schrieb Dennis Gearon:
Wow, if all you want is to retrieve by ID, a database would be fine, even a NO
SQL database.


  Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Erik Fäßler<erik.faess...@uni-jena.de>
To: solr-user@lucene.apache.org
Sent: Tue, November 16, 2010 12:33:28 AM
Subject: DIH full-import failure, no real error message

Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline (www.pubmed.gov,
over 18 million XML documents). My goal is to be able to retrieve single XML
documents by their ID. Each document comes with a unique ID, the PubMedID. So my
schema (important portions) looks like this:

<field name="pmid" type="string" indexed="true" stored="true" required="true" />
<field name="date" type="tdate" indexed="true" stored="true"/>
<field name="xml" type="text" indexed="true" stored="true"/>

<uniqueKey>pmid</uniqueKey>
<defaultSearchField>pmid</defaultSearchField>

pmid holds the ID, data hold the creation date; xml holds the whole XML document
(mostly below 5kb). I used the DataImporter to do this. I had to write some
classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically,
the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory usage is
quite below the maximum (max of 20g, usage of below 5g, most of the time around
3g). It goes several hours in this manner until it suddenly stopps. I tried this
a few times with minor tweaks, non of which made any difference. The last time
such a crash occurred, over 16.5 million documents already had been indexed
(argh, so close...). It never stops at the same document and trying to index the
documents, where the error occurred, just runs fine. Index size on disc was
between 40g and 50g the last time I had a look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no closing
brackets. The document is cut in half on this message and not even the error
message itself is complete: The 'D' of
(D)ataImporter.runCmd(DataImporter.java:389) right after the document text is
missing.

I have one thought concerning this: I get the input documents as an InputStream
which I read buffer-wise (at most 1000bytes per read() call). I need to deliver
the documents in one large byte-Array to the XML parser I use (VTD XML).
But I don't only get the individual small XML documents but always one larger
XML blob with exactly 30,000 of these documents. I use a self-written
EntityProcessor to extract the single documents from the larger blob. These
blobs have a size of about 50 to 150mb. So what I do is to read these large
blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>.
Afterwards, I create the ultimate byte[] and do System.arraycopy from the
ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the documents
where the error occurred just works fine (that is, indexing the whole blob
containing the single document). I just mention this because it kind of looks
like there is this cut in the document and the missing 'D' reminds me of
char-encoding errors. But I don't know for real, opening the error log in vi
doesn't show any broken characters (the last time I had such problems, vi could
identify the characters in question, other editors just wouldn't show them).

Further ideas from my side: Is the index too big? I think I read something about
a large index would be something around 10million documents, I aim to
approximately double this number. But would this cause such an error? In the
end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem as detailed as
possible. Thanks a lot for reading and I appreciate any ideas! :)

Best regards,

     Erik


Reply via email to