Md. Mizanur Rahoman wrote:
> Hi Paolo,
> 
> Thanks for your reply.
> 
> Right now I am only using DBPedia, Geoname and NYTimes for LOD cloud. And
> later on I want to extend my dataset.

Ok, so it's big, but not huge! ;-)
If you have enough RAM you can do everything on a single machine.

> By the way, yes, I can use sparql directly to collect my required
> statistics but my assumption is using Hadoop could give me some boosting in
> collecting those stat.

Well, it all depends if you already have an Hadoop cluster you can use.
If not, a single machine with a lot of RAM might be easier/faster/better.

> I will knock you after going through your links.

Sure, let me know how it goes.

Paolo

> 
> -
> Sincerely
> Md Mizanur
> 
> 
> 
> On Tue, Jun 26, 2012 at 12:50 AM, Paolo Castagna <
> [email protected]> wrote:
> 
>> Hi Mizanur,
>> when you have big RDF datasets, it might make sense to use MapReduce (but
>> only if you already have an Hadoop cluster at hand. Is this your case?).
>> You say that your data is 'huge', just for the sake of curiosity... how
>> many triples/quads is 'huge'? ;-)
>> Most of the use cases I've seen related to statistics on RDF datasets were
>> trivial MapReduce jobs.
>>
>> For a couple of examples on using MapReduce with RDF datasets have a look
>> here:
>> https://github.com/castagna/jena-grande
>> https://github.com/castagna/tdbloader4
>>
>> This, for example, is certainly not exactly what you need, but I am sure
>> that with little changes you can get what you want:
>>
>> https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java
>>
>> Last but not least, you'll need to dump your RDF data out onto HDFS.
>> I suggest you use N-Triples/N-Quads serialization formats.
>>
>> Running SPARQL queries on top of an Hadoop cluster is another (long and
>> not easy) story.
>> But, it might be possible to translate part of the SPARQL algebra into Pig
>> Latin scripts and use Pig.
>> In my opinion however, it makes more sense to use MapReduce to
>> filter/slice massive datasets, load the result into a triple store and
>> refine your data analysis using SPARQL there.
>>
>> My 2 cents,
>> Paolo
>>
>> Md. Mizanur Rahoman wrote:
>>> Dear All,
>>>
>>> I want to collect some statistics over RDF data. My triple store is
>>> Virtuoso and I am using Jena for executing my query.  I want to get some
>>> statistics like
>>> i) how many resources in my dataset ii) resources belong to in which
>>> position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want
>> to
>>> use Hadoop Map Reduce in calculating such statistics.
>>>
>>> Can you please suggest.
>>>
> 
> 
> 

Reply via email to