Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)
Egon On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller < sebastian.burgstal...@gmail.com> wrote: > Hi Denny, > Sorry, I missed this email. just did the calculation for InChI string > lengths on the 92 Mio PubChem compounds: > 99% 99.9% 100% > 311 676 4502 > > That said, there is not upper limit for the length, but 4502 is the > longest string in the PubChem database. The other IDs, canonical and > isomeric SMILES have the same distribution shape, but are overall > slightly shorter. > > Best, > Sebastian > > On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić <vrande...@gmail.com> > wrote: > > Can you figure out what a good limit would be for these two use cases? > I.e. > > what would support 99%, 99.9%, and 100%? > > > > > > On Sun, Sep 18, 2016, 12:27 Egon Willighagen <egon.willigha...@gmail.com > > > > wrote: > >> > >> Hi all, > >> > >> sorry for joining the party late... > >> > >> On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller > >> <sebastian.burgstal...@gmail.com> wrote: > >> > I think this topic might have been discussed many months ago. For > >> > certain data types in the chemical compound space (P233, canonical > >> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character > >> > limit than 400 would be really helpful (1500 to 2000 chars (I sense > >> > that this might cause problems with SPARQL)). Are there any plans on > >> > implementing this? In general, for quality assurance, many string > >> > property types would profit from a fixed max string length. > >> > >> 400 characters is not a lot for chemicals... InChIs can be a lot > >> larger indeed. 2k would allow us to capture a lot more chemicals. BTW, > >> this also applies to the canonical SMILES, which also doesn't have an > >> upper bound. Tannic acid (Q427956) is an example (which looking at the > >> InChIKey came up when running the bot :) From working with ChEMBL as > >> RDF I know it has InChIs of length > 1024, which was the max length in > >> Virtuoso... I think it's important for the biology and chemistry to > >> increase the limit. > >> > >> Egon > >> > >> -- > >> E.L. Willighagen > >> Department of Bioinformatics - BiGCaT > >> Maastricht University (http://www.bigcat.unimaas.nl/) > >> Homepage: http://egonw.github.com/ > >> LinkedIn: http://se.linkedin.com/in/egonw > >> Blog: http://chem-bla-ics.blogspot.com/ > >> PubList: http://www.citeulike.org/user/egonw/tag/papers > >> ORCID: 0000-0001-7542-0286 > >> ImpactStory: https://impactstory.org/EgonWillighagen > >> > >> _______________________________________________ > >> Wikidata mailing list > >> Wikidata@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/wikidata > > > > > > _______________________________________________ > > Wikidata mailing list > > Wikidata@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikidata > > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > -- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata