I'm building a system for processing large RDF data sets with Hadoop. https://github.com/paulhoule/infovore/wiki
The first stages are written in Java and perform the function of normalizing, validating and cleaning up the data. The stage that comes after this is going to subdivide Freebase into several major "horizontal" subdivisions that users may or may not want. For instance, Freebase uses two different vocabularies for expressing external keys -- they both represent 100+ million plus facts so it's desirable to pick one you like and throw the other in the bit bucket. That phase will probably be written in Java, but to do the research to figure out how to partition it, I want to do ad-hoc queries with Pig. The first thing I'm working on is a input UDF for reading N-Triples files; rather than deeply parsing the Nodes, I'm splitting the triples up into three Texts. This process isn't too different from reading a white-space separated file, but it's a little more complicated because sometimes there are spaces in the object field. You also need to trim off a period and maybe some whitespace at the end. Now, it turns out the my UDF depends on classes I wrote distributed throughout three different Maven projects (the PrimitiveTriple parser has been around for a while) so I need to REGISTER multiple Jar files. I also heavily use Guava and other third-party libraries so the list of things I need to REGISTER is pretty big What I'm trying now is to run this program https://github.com/paulhoule/infovore/blob/master/chopper/src/main/java/com/ontology2/chopper/tools/GenerateRegisterStatements.java piping it like so mvn dependency::build-classpath | mvn exec::java -Dexec.mainClass=com.ontology2.chopper.tools.GenerateRegisterStatements This could be integrated into the maven build process in the future. Anyway, is there a better way to do this?
