Wow, if all you want is to retrieve by ID, a database would be fine, even a NO 
SQL database.


 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Erik Fäßler <erik.faess...@uni-jena.de>
To: solr-user@lucene.apache.org
Sent: Tue, November 16, 2010 12:33:28 AM
Subject: DIH full-import failure, no real error message

Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline 
(www.pubmed.gov, 
over 18 million XML documents). My goal is to be able to retrieve single XML 
documents by their ID. Each document comes with a unique ID, the PubMedID. So 
my 
schema (important portions) looks like this:

<field name="pmid" type="string" indexed="true" stored="true" required="true" />
<field name="date" type="tdate" indexed="true" stored="true"/>
<field name="xml" type="text" indexed="true" stored="true"/>

<uniqueKey>pmid</uniqueKey>
<defaultSearchField>pmid</defaultSearchField>

pmid holds the ID, data hold the creation date; xml holds the whole XML 
document 
(mostly below 5kb). I used the DataImporter to do this. I had to write some 
classes (DataSource, EntityProcessor, DateFormatter) myself, so theoretically, 
the error could lie there.

What happens is that indexing just looks fine at the beginning. Memory usage is 
quite below the maximum (max of 20g, usage of below 5g, most of the time around 
3g). It goes several hours in this manner until it suddenly stopps. I tried 
this 
a few times with minor tweaks, non of which made any difference. The last time 
such a crash occurred, over 16.5 million documents already had been indexed 
(argh, so close...). It never stops at the same document and trying to index 
the 
documents, where the error occurred, just runs fine. Index size on disc was 
between 40g and 50g the last time I had a look.

This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are no 
closing 
brackets. The document is cut in half on this message and not even the error 
message itself is complete: The 'D' of 
(D)ataImporter.runCmd(DataImporter.java:389) right after the document text is 
missing.

I have one thought concerning this: I get the input documents as an InputStream 
which I read buffer-wise (at most 1000bytes per read() call). I need to deliver 
the documents in one large byte-Array to the XML parser I use (VTD XML).
But I don't only get the individual small XML documents but always one larger 
XML blob with exactly 30,000 of these documents. I use a self-written 
EntityProcessor to extract the single documents from the larger blob. These 
blobs have a size of about 50 to 150mb. So what I do is to read these large 
blobs in 1000bytes steps and store each byte array in an ArrayList<byte[]>. 
Afterwards, I create the ultimate byte[] and do System.arraycopy from the 
ArrayList into the byte[].
I tested this and it looks fine to me. And how I said, indexing the documents 
where the error occurred just works fine (that is, indexing the whole blob 
containing the single document). I just mention this because it kind of looks 
like there is this cut in the document and the missing 'D' reminds me of 
char-encoding errors. But I don't know for real, opening the error log in vi 
doesn't show any broken characters (the last time I had such problems, vi could 
identify the characters in question, other editors just wouldn't show them).

Further ideas from my side: Is the index too big? I think I read something 
about 
a large index would be something around 10million documents, I aim to 
approximately double this number. But would this cause such an error? In the 
end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem as detailed as 
possible. Thanks a lot for reading and I appreciate any ideas! :)

Best regards,

    Erik

Reply via email to