Hi again,
I'm still working on the idmapping triple store which I mentioned in my
previous request. I wanted to get some advice on the way I'm populating the
triple store this time around. I'm doing this using the jena API in a java
program that I'm writing. First off, the data which I'm consuming is in a tab
separated value file of around 116M lines, which has the following style:
Q6GZX4 001R_FRG3G 2947773 YP_031579.1 81941549; 49237298 GO:0006355;
GO:0046782; GO:0006351 UniRef100_Q6GZX4 UniRef90_Q6GZX4 UniRef50_Q6GZX4
UPI00003B0FD4 654924 15165820 AY548484 AAT09660.1
Where the two ids I'm interested in are the first one, which is the Uniprot Id,
and the third one, which is an Entrez Id. This third column can sometimes be
blank, and sometimes contain several values, in the format "12345; 12346".
Below is the Java code that I have written to put this information into a
dataset.
public void doWithDataset(Consumer<Dataset> function) {
ARQ.init();
Dataset ds = TDBFactory.createDataset(directory);
ds.begin(ReadWrite.WRITE);
function.accept(ds);
ds.commit();
ds.end();
}
private void loadUniprotFile(InputStream uniprotData, AtomicInteger seed) {
doWithDataset(ds -> {
try {
BufferedReader bReader = new BufferedReader(new InputStreamReader(uniprotData));
int numberOfTriples = 0;
int modelNumber = 0;
Model model = addSchemaToModel(ds.getNamedModel(PROTEIN_MODEL + modelNumber));
String line = bReader.readLine();
for (int i = 0; line != null; i++) {
String[] content = line.split("\t");
if (!content[2].isEmpty()) {
String[] entrezIds = content[2].split("; ");
String uniprotId = content[0];
for (String entrezId : entrezIds) {
numberOfTriples += 2;
addEntrezUniprotLinkToModel(model, seed, uniprotId, entrezId);
}
}
line = bReader.readLine();
if (numberOfTriples % LINES_PER_TRANSACTION == 0 && i != 0) {
ds.commit();
ds.end();
ds.begin(ReadWrite.WRITE);
modelNumber++;
model = addSchemaToModel(ds.getNamedModel(PROTEIN_MODEL + modelNumber));
}
if (i%10000==0) {
System.out.println("Protein Line:"+i);
}
}
} catch (IOException ioe) {
throw new RuntimeException(ioe);
}
});
}
private static void addEntrezUniprotLinkToModel(Model model, AtomicInteger
seed, String uniprotId, String entrezId) {
String uniprotUri = UNIPROT_URI + uniprotId;
String entrezUri = NCBI_URI + entrezId;
String internalId = GENE_DATA_URI + seed.getAndIncrement();
Resource resource = model.createResource(internalId);
resource.addProperty(model.getProperty(ID_MAPPING), uniprotUri);
resource.addProperty(model.getProperty(ID_MAPPING), entrezUri);
}
I estimate that this file should result in the area of 22M triples being
created. The last time this was run, it took around a week to complete, where
the LINES_PER_TRANSACTION value was set to 1000. I've tweaked this number
around to get some faster results before running into memory errors. Is this
simply a case of finding a sweet spot for this value, or are there other
suggestions you think could improve what I've done.
I had noticed while monitoring this using visualvm that a lot of CPU usage was
around this BlockAccessMapped.flushDirtySegments() method, increasingly so as
the program was progressing.
Thanks for your time,
Adam