OK, thanks! BTW, I could confirm the NPE solved by adding that json-YYYYMMDD/ subdir...
Another question: is it possible to cancel to process of parsing a datadump file programmatically? I saw the time out, but integrating it in a GUI where the user may push a cancel button, and would be nice if I could propagate that, and stop the actual processing... Egon On Sun, Jan 18, 2015 at 3:23 PM, Markus Krötzsch <[email protected]> wrote: > The issue was fixed in master now. I also added some more INFO-type messages > that will report about the dump files found online and locally. > > Cheers, > > Markus > > > On 18.01.2015 14:26, Markus Krötzsch wrote: >> >> On 18.01.2015 10:58, Egon Willighagen wrote: >>> >>> On Sat, Jan 17, 2015 at 11:04 PM, Markus Krötzsch >>> <[email protected]> wrote: >>>> >>>> It is easy to fix this (though I will not fix it tonight, but >>>> tomorrow) by >>>> just adjusting the HTML strings we parse for. >>> >>> >>> Sure! I have subscribed to the bug report. >>> >>> As an intermediate workaround for me, what file name pattern is used >>> in the local cache? >>> >>> I had manually downloaded a file (and made it available as torrent >>> because it was only at about 1 MB/s, [0]) and put this in the folder, >>> but it was not recognized... the file on the server is: >>> http://dumps.wikimedia.org/other/wikidata/20150112.json.gz >>> >>> But as 20150112.json.gz it is not detected... I noted the the json-* >>> pattern in the code, but json-20150112.json.gz didn't work either... >> >> >> The dump files are put into subdirectories of the current directory >> ("."), for example: >> >> ./dumpfiles/wikidatawiki/json-20150105/20150105.json.gz >> (JSON dump) >> >> >> ./dumpfiles/wikidatawiki/current-20141009/wikidatawiki-20141009-pages-meta-current.xml.bz2 >> >> (current revision XML dump) >> >> If you create a directory of this form and put a file in there with the >> file name as found online, then the tool will find it. >> >>> >>> BTW, a second question, is there a way to list all local (JSON) dumps >>> using the WDTK api? >> >> >> Yes, though it's not very convenient right now. To restrict to local >> files, you can use the DumpProcessingController in offline mode (then it >> only looks at local files): >> >> >> DumpProcessingController dumpProcessingController = >> new DumpProcessingController("wikidatawiki"); >> dumpProcessingController.setOfflineMode(true); >> >> List<MwDumpFile> localJsonDumps = >> dumpProcessingController. >> getWmfDumpFileManager(). >> findAllDumps(DumpContentType.JSON); >> >> This gives you a list of MwDumpFile objects that you can access to get >> their date (getDateStamp()) and also to access the file contents. >> >> I think we should log some additional messages about the files that are >> found and used. >> >> Cheers, >> >> Markus >> >>> >>>> We should also improve our error reporting for this case, obviously. >>> >>> >>> Yeah, that's an art no software I ever worked with mastered... it's >>> hard! But it's important... I was completely looking in the wrong >>> place... mind you, monitoring logging messages can be hard too, when >>> WDTK is used in other environments, such as Bioclipse, and you cannot >>> rely on those message to show up :( >>> >>> Thanks for immediately looking into it and looking forward to pointers >>> for my two questions, >>> >>> greetings, >>> >>> Egon >>> >> > > > _______________________________________________ > Wikidata-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen _______________________________________________ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
