Re: [Wikidata] 2 million queries against a Wikidata instance

Adam Sanchez Fri, 31 Jul 2020 01:53:33 -0700

Dear all,

Finally, I was able to execute the 2 millions of queries in 24 minutes.
The solution was to load Virtuoso fully in-memory. I created a ramdisk
filesystem and I copied the full Virtuoso installation there.
The copy was done in a few minutes.I know Wikidata is stored in a
volatile memory but I already synchronised this folder with a folder
in the ssd disk. I think this solution could also be used to load
Wikidata even faster if Virtuoso is stored in a ramdisk-based
directory. When the loading is done, the folder could be moved back
from the ramdisk-directory to a harddisk directory for data
persistence
Thanks for all your suggestions and ideas. It saved me time because I
was able to narrow the set of possible solutions between software and
hardware.


Best,

Adam

links
https://www.linuxbabe.com/command-line/create-ramdisk-linux


Le jeu. 23 juil. 2020 à 09:01, Aidan Hogan <[email protected]> a écrit :
>
> Hi Adam,
>
> On 2020-07-13 13:41, Adam Sanchez wrote:
> > Hi,
> >
> > I have to launch 2 million queries against a Wikidata instance.
> > I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with 
> > RAID 0).
> > The queries are simple, just 2 types.
> >
> > select ?s ?p ?o {
> > ?s ?p ?o.
> > filter (?s = ?param)
> > }
> >
> > select ?s ?p ?o {
> > ?s ?p ?o.
> > filter (?o = ?param)
> > }
> >
> > If I use a Java ThreadPoolExecutor takes 6 hours.
> > How can I speed up the queries processing even more?
>
> Perhaps I am a bit late to respond.
>
> It's not really clear to me what you are aiming for, but if this is a
> once-off task, I would recommend to download the dump in Turtle or
> N-Triples, load your two million parameters in memory in a sorted or
> hashed data structure in the programming language of your choice (should
> take considerably less than 1GB of memory assuming typical constants),
> use a streaming RDF parser for that language, and for each
> subject/object, check if its in your list in memory. This solution is
> about as good as you can get in terms of once-off batch processing.
>
> If your idea is to index the data so you can do 2 million lookups in
> "interactive time", your problem is not what software to use, it's what
> hardware to use.
>
> Traditional hard disks have a physical arm that takes maybe 5-10 ms to
> move. Sold state disks are quite a bit better but still have seeks in
> the range of 0.1 ms. Multiply those seek times by 2 million and you have
> a long wait (caching will help, as will multiple disks, but not by
> nearly enough). You would need to get the data into main memory (RAM) to
> have any chance of approximating interactive times, and even still you
> will probably not get interactive runtimes without leveraging some
> further assumptions about what you want to do to optimise further (e.g.,
> if you're only interesting in Q ids, you can use integers or bit
> vectors, etc). In the most general case, you would probably need to
> pre-filter the data as much as you can, and also use as much compression
> as you can (ideally with compact data structures) to get the data into
> memory on one machine, or you might think about something like Redis
> (in-memory key-value store) on lots of machines. Essentially, if your
> goal is interactive times on millions of lookups, you very likely need
> to look at options purely in RAM (unless you have thousands of disks
> available at least). The good news is that 512GB(?) sounds like a lot of
> space to store stuff in.
>
> Best,
> Aidan
>
> > I was thinking :
> >
> > a) to implement a Virtuoso cluster to distribute the queries or
> > b) to load Wikidata in a Spark dataframe (since Sansa framework is
> > very slow, I would use my own implementation) or
> > c) to load Wikidata in a Postgresql table and use Presto to distribute
> > the queries or
> > d) to load Wikidata in a PG-Strom table to use GPU parallelism.
> >
> > What do you think? I am looking for ideas.
> > Any suggestion will be appreciated.
> >
> > Best,
> >
> > _______________________________________________
> > Wikidata mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] 2 million queries against a Wikidata instance

Reply via email to