DIH full-import failure, no real error message

Erik Fäßler Tue, 16 Nov 2010 00:34:13 -0800

 Hey all,

I'm trying to create a Solr index for the 2010 Medline-baseline(www.pubmed.gov, over 18 million XML documents). My goal is to be ableto retrieve single XML documents by their ID. Each document comes with aunique ID, the PubMedID. So my schema (important portions) looks like this:

<field name="pmid" type="string" indexed="true" stored="true"required="true" />

<field name="date" type="tdate" indexed="true" stored="true"/>
<field name="xml" type="text" indexed="true" stored="true"/>

<uniqueKey>pmid</uniqueKey>
<defaultSearchField>pmid</defaultSearchField>

pmid holds the ID, data hold the creation date; xml holds the whole XMLdocument (mostly below 5kb). I used the DataImporter to do this. I hadto write some classes (DataSource, EntityProcessor, DateFormatter)myself, so theoretically, the error could lie there.

What happens is that indexing just looks fine at the beginning. Memoryusage is quite below the maximum (max of 20g, usage of below 5g, most ofthe time around 3g). It goes several hours in this manner until itsuddenly stopps. I tried this a few times with minor tweaks, non ofwhich made any difference. The last time such a crash occurred, over16.5 million documents already had been indexed (argh, so close...). Itnever stops at the same document and trying to index the documents,where the error occurred, just runs fine. Index size on disc was between40g and 50g the last time I had a look.


This is the log from beginning to end:

(I decided to just attach the log for the sake of readability ;) ).

As you can see, Solr's error message is not quite complete. There are noclosing brackets. The document is cut in half on this message and noteven the error message itself is complete: The 'D' of(D)ataImporter.runCmd(DataImporter.java:389) right after the documenttext is missing.

I have one thought concerning this: I get the input documents as anInputStream which I read buffer-wise (at most 1000bytes per read()call). I need to deliver the documents in one large byte-Array to theXML parser I use (VTD XML).But I don't only get the individual small XML documents but always onelarger XML blob with exactly 30,000 of these documents. I use aself-written EntityProcessor to extract the single documents from thelarger blob. These blobs have a size of about 50 to 150mb. So what I dois to read these large blobs in 1000bytes steps and store each bytearray in an ArrayList<byte[]>. Afterwards, I create the ultimate byte[]and do System.arraycopy from the ArrayList into the byte[].I tested this and it looks fine to me. And how I said, indexing thedocuments where the error occurred just works fine (that is, indexingthe whole blob containing the single document). I just mention thisbecause it kind of looks like there is this cut in the document and themissing 'D' reminds me of char-encoding errors. But I don't know forreal, opening the error log in vi doesn't show any broken characters(the last time I had such problems, vi could identify the characters inquestion, other editors just wouldn't show them).

Further ideas from my side: Is the index too big? I think I readsomething about a large index would be something around 10milliondocuments, I aim to approximately double this number. But would thiscause such an error? In the end: What exactly IS the error?

Sorry for the lot of text, just trying to describe the problem asdetailed as possible. Thanks a lot for reading and I appreciate anyideas! :)


Best regards,

    Erik

15.11.2010 11:08:22 org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: newest commit = 1289465394071
15.11.2010 18:16:06 org.apache.solr.handler.dataimport.SolrWriter upload
WARNUNG: Error creating document : SolrInputDocument[{pmid=pmid(1.0)={8817856}, 
xml=xml(1.0)={<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID>8817856</PMID>
<DateCreated>
<Year>1996</Year>
<Month>12</Month>
<Day>04</Day>
</DateCreated>
<DateCompleted>
<Year>1996</Year>
<Month>12</Month>
<Day>04</Day>
</DateCompleted>
<DateRevised>
<Year>2004</Year>
<Month>11</Month>
<Day>17</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0042-4900</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>138</Volume>
<Issue>26</Issue>
<PubDate>
<Year>1996</Year>
<Month>Jun</Month>
<Day>29</Day>
</PubDate>
</JournalIssue>
<Title>The Veterinary record</Title>
<ISOAbbreviation>Vet. Rec.</ISOAbbreviation>
</Journal>
<ArticleTitle>Restoring confidence in beef: towards a European 
solution.</ArticleTitle>
<Pagination>
<MedlinePgn>631-2</MedlinePgn>
</Pagination>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType>News</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>ENGLAND</Country>
<MedlineTA>Vet Rec</MedlineTA>
<NlmUniqueID>0031164</NlmUniqueID>
<ISSNLinking>0042-4900</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Animals</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Cattle</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Commerce</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Encephalopathy, Bovine 
Spongiform</DescriptorName>
<QualifierName MajorTopicYN="N">prevention & control</QualifierName>
<QualifierName MajorTopicYN="Y">transmission</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Food Contamination</DescriptorName>
<QualifierName MajorTopicYN="Y">prevention & control</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Great Britain</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName 
MajorTopicYN="Y">Meat</DescriptorNamataImporter.runCmd(DataImporter.java:389)
        at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
370)
16.11.2010 03:28:16 org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback

DIH full-import failure, no real error message

Reply via email to