Hi again, I'm still working on the idmapping triple store which I mentioned in my previous request. I wanted to get some advice on the way I'm populating the triple store this time around. I'm doing this using the jena API in a java program that I'm writing. First off, the data which I'm consuming is in a tab separated value file of around 116M lines, which has the following style:
Q6GZX4 001R_FRG3G 2947773 YP_031579.1 81941549; 49237298 GO:0006355; GO:0046782; GO:0006351 UniRef100_Q6GZX4 UniRef90_Q6GZX4 UniRef50_Q6GZX4 UPI00003B0FD4 654924 15165820 AY548484 AAT09660.1 Where the two ids I'm interested in are the first one, which is the Uniprot Id, and the third one, which is an Entrez Id. This third column can sometimes be blank, and sometimes contain several values, in the format "12345; 12346". Below is the Java code that I have written to put this information into a dataset. public void doWithDataset(Consumer<Dataset> function) { ARQ.init(); Dataset ds = TDBFactory.createDataset(directory); ds.begin(ReadWrite.WRITE); function.accept(ds); ds.commit(); ds.end(); } private void loadUniprotFile(InputStream uniprotData, AtomicInteger seed) { doWithDataset(ds -> { try { BufferedReader bReader = new BufferedReader(new InputStreamReader(uniprotData)); int numberOfTriples = 0; int modelNumber = 0; Model model = addSchemaToModel(ds.getNamedModel(PROTEIN_MODEL + modelNumber)); String line = bReader.readLine(); for (int i = 0; line != null; i++) { String[] content = line.split("\t"); if (!content[2].isEmpty()) { String[] entrezIds = content[2].split("; "); String uniprotId = content[0]; for (String entrezId : entrezIds) { numberOfTriples += 2; addEntrezUniprotLinkToModel(model, seed, uniprotId, entrezId); } } line = bReader.readLine(); if (numberOfTriples % LINES_PER_TRANSACTION == 0 && i != 0) { ds.commit(); ds.end(); ds.begin(ReadWrite.WRITE); modelNumber++; model = addSchemaToModel(ds.getNamedModel(PROTEIN_MODEL + modelNumber)); } if (i%10000==0) { System.out.println("Protein Line:"+i); } } } catch (IOException ioe) { throw new RuntimeException(ioe); } }); } private static void addEntrezUniprotLinkToModel(Model model, AtomicInteger seed, String uniprotId, String entrezId) { String uniprotUri = UNIPROT_URI + uniprotId; String entrezUri = NCBI_URI + entrezId; String internalId = GENE_DATA_URI + seed.getAndIncrement(); Resource resource = model.createResource(internalId); resource.addProperty(model.getProperty(ID_MAPPING), uniprotUri); resource.addProperty(model.getProperty(ID_MAPPING), entrezUri); } I estimate that this file should result in the area of 22M triples being created. The last time this was run, it took around a week to complete, where the LINES_PER_TRANSACTION value was set to 1000. I've tweaked this number around to get some faster results before running into memory errors. Is this simply a case of finding a sweet spot for this value, or are there other suggestions you think could improve what I've done. I had noticed while monitoring this using visualvm that a lot of CPU usage was around this BlockAccessMapped.flushDirtySegments() method, increasingly so as the program was progressing. Thanks for your time, Adam