On Tue, Jul 26, 2011 at 3:51 PM, Simon Kornblith <[email protected]> wrote: > On Jul 26, 2011, at 3:13 PM, Bruce D'Arcus wrote: > >> On Tue, Jul 26, 2011 at 2:36 PM, Simon Kornblith <[email protected]> wrote: >> >>> So, I have a crazy idea of how to shift as much of the complexity of >>> generating CSL away from the user as possible. Essentially, I want to be >>> able to copy and paste bibliography entries from a journal's reference list >>> into a box and end up with a formatted style. >> >> Indeed, this would probably be the ideal (except that, note: most of >> the time, the examples aren't extensive enough to account for what >> authors often need; code should account for that if it can). > > That's the rationale behind using existing macros when they fit, instead of > trying to infer everything, but there may still be some issues with this. > >>> As far as the implementation goes, we would need to: >>> 1) Convert the bibliography entries to a series of labeled fields using a >>> parser such as FreeCite. >>> 2) Where possible, string together macros from existing styles to generate >>> the output. >>> 3) If the output contains a substring that cannot be generated using >>> existing macros, generate a new macro to generate only that substring and >>> use existing macros for the rest. In order to avoid generating macros that >>> work for only a limited set of references (e.g., "(" as a prefix on one >>> element and ")" as a suffix on a different element), this would need to be >>> done either using a statistical model based on the distribution of prefixes, >>> suffixes, and group delimiters in the CSL repository and choosing the most >>> likely macro, or by using a set of heuristics. >>> As far as (3) goes, I made a naive implementation of the former in >>> Scheme/MIT Church (https://github.com/simonster/csl-inference) that mostly >>> works. MIT Church is really nice in some ways, but the inference is >>> imperfect (samples are not actually independent). Heuristics would >>> undoubtedly be faster, and might work better. >> >> Why MIT Church, and not, say, Python? Just something you'd been >> playing with, or is there some other reason? > > MIT Church has a lot of rough edges, but it makes performing this kind of > inference very simple. Essentially, you can write code to generate a random > sample from some distribution (a generative model), and it will find samples > that match a given set of parameters, even when drawing a sample with those > parameters by chance is highly improbable. That code contains a routine to > generate a random CSL substring from a distribution defined by the prefixes, > suffixes, and group delimiters in the CSL repository, which is very large. > Church's mh-query function takes that function and samples that very large > distribution of substrings for a CSL substring that matches the given output. > Since the CSL generating routine is more likely to give samples that more > closely resemble the repository, CSL substrings are more likely to resemble > those in the repository than not. Church is intended to make writing code to > perform this kind of inference very easy. > > Unfortunately, Church is very computationally intensive, and the algorithm it > uses for inference (Metropolis-Hastings) might be suboptimal for this kind of > problem judging by the results, so I'm not sure this code has much of a > future besides as a proof of concept.
Do you have some thoughts on a possibly more appropriate algorithm, should someone want to explore alternatives? [...snip...] Bruce ------------------------------------------------------------------------------ Magic Quadrant for Content-Aware Data Loss Prevention Research study explores the data loss prevention market. Includes in-depth analysis on the changes within the DLP market, and the criteria used to evaluate the strengths and weaknesses of these DLP solutions. http://www.accelacomm.com/jaw/sfnl/114/51385063/ _______________________________________________ xbiblio-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xbiblio-devel
