Populating a TDB dataset through Java code.

Adam Ladly Thu, 21 Jun 2018 10:53:19 -0700

Hi again,

I'm still working on the idmapping triple store which I mentioned in my 
previous request. I wanted to get some advice on the way I'm populating the 
triple store this time around. I'm doing this using the jena API in a java 
program that I'm writing. First off, the data which I'm consuming is in a tab 
separated value file of around 116M lines, which has the following style:


Q6GZX4 001R_FRG3G 2947773 YP_031579.1 81941549; 49237298 GO:0006355; 
GO:0046782; GO:0006351 UniRef100_Q6GZX4 UniRef90_Q6GZX4 UniRef50_Q6GZX4 
UPI00003B0FD4 654924 15165820 AY548484 AAT09660.1

Where the two ids I'm interested in are the first one, which is the Uniprot Id, 
and the third one, which is an Entrez Id. This third column can sometimes be 
blank, and sometimes contain several values, in the format "12345; 12346".

Below is the Java code that I have written to put this information into a 
dataset.

public void doWithDataset(Consumer<Dataset> function) {
ARQ.init();
Dataset ds = TDBFactory.createDataset(directory);
ds.begin(ReadWrite.WRITE);

function.accept(ds);

ds.commit();
ds.end();
}

private void loadUniprotFile(InputStream uniprotData, AtomicInteger seed) {
doWithDataset(ds -> {
try {
BufferedReader bReader = new BufferedReader(new InputStreamReader(uniprotData));

int numberOfTriples = 0;
int modelNumber = 0;
Model model = addSchemaToModel(ds.getNamedModel(PROTEIN_MODEL + modelNumber));

String line = bReader.readLine();
for (int i = 0; line != null; i++) {
String[] content = line.split("\t");
if (!content[2].isEmpty()) {
String[] entrezIds = content[2].split("; ");
String uniprotId = content[0];
for (String entrezId : entrezIds) {
numberOfTriples += 2;
addEntrezUniprotLinkToModel(model, seed, uniprotId, entrezId);
}
}
line = bReader.readLine();

if (numberOfTriples % LINES_PER_TRANSACTION == 0 && i != 0) {
ds.commit();
ds.end();
ds.begin(ReadWrite.WRITE);
modelNumber++;
model = addSchemaToModel(ds.getNamedModel(PROTEIN_MODEL + modelNumber));
}

if (i%10000==0) {
System.out.println("Protein Line:"+i);
}
}
} catch (IOException ioe) {
throw new RuntimeException(ioe);
}
});
}
private static void addEntrezUniprotLinkToModel(Model model, AtomicInteger 
seed, String uniprotId, String entrezId) {
String uniprotUri = UNIPROT_URI + uniprotId;
String entrezUri = NCBI_URI + entrezId;
String internalId = GENE_DATA_URI + seed.getAndIncrement();
Resource resource = model.createResource(internalId);
resource.addProperty(model.getProperty(ID_MAPPING), uniprotUri);
resource.addProperty(model.getProperty(ID_MAPPING), entrezUri);
}
I estimate that this file should result in the area of 22M triples being 
created. The last time this was run, it took around a week to complete, where 
the LINES_PER_TRANSACTION value was set to 1000. I've tweaked this number 
around to get some faster results before running into memory errors. Is this 
simply a case of finding a sweet spot for this value, or are there other 
suggestions you think could improve what I've done.

I had noticed while monitoring this using visualvm that a lot of CPU usage was 
around this BlockAccessMapped.flushDirtySegments() method, increasingly so as 
the program was progressing.

Thanks for your time,
Adam

Populating a TDB dataset through Java code.

Reply via email to