Dear all, Finally, I was able to execute the 2 millions of queries in 24 minutes. The solution was to load Virtuoso fully in-memory. I created a ramdisk filesystem and I copied the full Virtuoso installation there. The copy was done in a few minutes.I know Wikidata is stored in a volatile memory but I already synchronised this folder with a folder in the ssd disk. I think this solution could also be used to load Wikidata even faster if Virtuoso is stored in a ramdisk-based directory. When the loading is done, the folder could be moved back from the ramdisk-directory to a harddisk directory for data persistence Thanks for all your suggestions and ideas. It saved me time because I was able to narrow the set of possible solutions between software and hardware.
Best, Adam links https://www.linuxbabe.com/command-line/create-ramdisk-linux Le jeu. 23 juil. 2020 à 09:01, Aidan Hogan <[email protected]> a écrit : > > Hi Adam, > > On 2020-07-13 13:41, Adam Sanchez wrote: > > Hi, > > > > I have to launch 2 million queries against a Wikidata instance. > > I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with > > RAID 0). > > The queries are simple, just 2 types. > > > > select ?s ?p ?o { > > ?s ?p ?o. > > filter (?s = ?param) > > } > > > > select ?s ?p ?o { > > ?s ?p ?o. > > filter (?o = ?param) > > } > > > > If I use a Java ThreadPoolExecutor takes 6 hours. > > How can I speed up the queries processing even more? > > Perhaps I am a bit late to respond. > > It's not really clear to me what you are aiming for, but if this is a > once-off task, I would recommend to download the dump in Turtle or > N-Triples, load your two million parameters in memory in a sorted or > hashed data structure in the programming language of your choice (should > take considerably less than 1GB of memory assuming typical constants), > use a streaming RDF parser for that language, and for each > subject/object, check if its in your list in memory. This solution is > about as good as you can get in terms of once-off batch processing. > > If your idea is to index the data so you can do 2 million lookups in > "interactive time", your problem is not what software to use, it's what > hardware to use. > > Traditional hard disks have a physical arm that takes maybe 5-10 ms to > move. Sold state disks are quite a bit better but still have seeks in > the range of 0.1 ms. Multiply those seek times by 2 million and you have > a long wait (caching will help, as will multiple disks, but not by > nearly enough). You would need to get the data into main memory (RAM) to > have any chance of approximating interactive times, and even still you > will probably not get interactive runtimes without leveraging some > further assumptions about what you want to do to optimise further (e.g., > if you're only interesting in Q ids, you can use integers or bit > vectors, etc). In the most general case, you would probably need to > pre-filter the data as much as you can, and also use as much compression > as you can (ideally with compact data structures) to get the data into > memory on one machine, or you might think about something like Redis > (in-memory key-value store) on lots of machines. Essentially, if your > goal is interactive times on millions of lookups, you very likely need > to look at options purely in RAM (unless you have thousands of disks > available at least). The good news is that 512GB(?) sounds like a lot of > space to store stuff in. > > Best, > Aidan > > > I was thinking : > > > > a) to implement a Virtuoso cluster to distribute the queries or > > b) to load Wikidata in a Spark dataframe (since Sansa framework is > > very slow, I would use my own implementation) or > > c) to load Wikidata in a Postgresql table and use Presto to distribute > > the queries or > > d) to load Wikidata in a PG-Strom table to use GPU parallelism. > > > > What do you think? I am looking for ideas. > > Any suggestion will be appreciated. > > > > Best, > > > > _______________________________________________ > > Wikidata mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikidata > > _______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
