Md. Mizanur Rahoman wrote: > Hi Paolo, > > Thanks for your reply. > > Right now I am only using DBPedia, Geoname and NYTimes for LOD cloud. And > later on I want to extend my dataset.
Ok, so it's big, but not huge! ;-) If you have enough RAM you can do everything on a single machine. > By the way, yes, I can use sparql directly to collect my required > statistics but my assumption is using Hadoop could give me some boosting in > collecting those stat. Well, it all depends if you already have an Hadoop cluster you can use. If not, a single machine with a lot of RAM might be easier/faster/better. > I will knock you after going through your links. Sure, let me know how it goes. Paolo > > - > Sincerely > Md Mizanur > > > > On Tue, Jun 26, 2012 at 12:50 AM, Paolo Castagna < > [email protected]> wrote: > >> Hi Mizanur, >> when you have big RDF datasets, it might make sense to use MapReduce (but >> only if you already have an Hadoop cluster at hand. Is this your case?). >> You say that your data is 'huge', just for the sake of curiosity... how >> many triples/quads is 'huge'? ;-) >> Most of the use cases I've seen related to statistics on RDF datasets were >> trivial MapReduce jobs. >> >> For a couple of examples on using MapReduce with RDF datasets have a look >> here: >> https://github.com/castagna/jena-grande >> https://github.com/castagna/tdbloader4 >> >> This, for example, is certainly not exactly what you need, but I am sure >> that with little changes you can get what you want: >> >> https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java >> >> Last but not least, you'll need to dump your RDF data out onto HDFS. >> I suggest you use N-Triples/N-Quads serialization formats. >> >> Running SPARQL queries on top of an Hadoop cluster is another (long and >> not easy) story. >> But, it might be possible to translate part of the SPARQL algebra into Pig >> Latin scripts and use Pig. >> In my opinion however, it makes more sense to use MapReduce to >> filter/slice massive datasets, load the result into a triple store and >> refine your data analysis using SPARQL there. >> >> My 2 cents, >> Paolo >> >> Md. Mizanur Rahoman wrote: >>> Dear All, >>> >>> I want to collect some statistics over RDF data. My triple store is >>> Virtuoso and I am using Jena for executing my query. I want to get some >>> statistics like >>> i) how many resources in my dataset ii) resources belong to in which >>> position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want >> to >>> use Hadoop Map Reduce in calculating such statistics. >>> >>> Can you please suggest. >>> > > >
