Re: Freebase Gridworks 1.0 released [Was: Nice Data Cleansing Tool Demo]

2010-05-10 Thread Kingsley Idehen

David Huynh wrote:

Hi all,

We're happy to announce that Freebase Gridworks 1.0 is now available 
for download, and it is also released as open source software:


Download, documentation, code, bugs:
http://code.google.com/p/freebase-gridworks/

Mailing list:
http://groups.google.com/group/freebase-gridworks

Gridworks is a power tool that allows you to load data, understand it, 
clean it up, reconcile it internally, augment it with data coming from 
Freebase, and optionally contribute your data to Freebase for others 
to use.


If you have seen the screencasts mentioned earlier [1], i.e.,

   Introduction: http://vimeo.com/10081183
   Faceting: http://vimeo.com/10287824

please know that there have been significant changes made to the 
software from the feedback of our alpha testers. The most important 
changes are the ability to add data from Freebase into your data sets, 
and the ability to load your data into Freebase (sandbox only for 
now). Data loads through Gridworks can be tracked here


http://gridworks-loads.freebaseapps.com/

Please try out Gridworks and join us on the mailing list mentioned 
above for discussion!


David

[1] 
http://lists.freebase.com/pipermail/freebase-discuss/2010-March/000860.html




On Mar/28/10 8:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open data 
realm. This tool deals with ETL very well. Of course, for those who 
appreciate OWL, a lot of what's demonstrated in this demo is also 
achievable via "context rules". Bottom line (imho), nice tool that 
will only aid improving Web of Linked Data quality at the data set 
production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks





Wow!!

Great job David and Stefan!

--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Freebase Gridworks 1.0 released [Was: Nice Data Cleansing Tool Demo]

2010-05-10 Thread David Huynh

Hi all,

We're happy to announce that Freebase Gridworks 1.0 is now available for 
download, and it is also released as open source software:


Download, documentation, code, bugs:
http://code.google.com/p/freebase-gridworks/

Mailing list:
http://groups.google.com/group/freebase-gridworks

Gridworks is a power tool that allows you to load data, understand it, 
clean it up, reconcile it internally, augment it with data coming from 
Freebase, and optionally contribute your data to Freebase for others to use.


If you have seen the screencasts mentioned earlier [1], i.e.,

   Introduction: http://vimeo.com/10081183
   Faceting: http://vimeo.com/10287824

please know that there have been significant changes made to the 
software from the feedback of our alpha testers. The most important 
changes are the ability to add data from Freebase into your data sets, 
and the ability to load your data into Freebase (sandbox only for now). 
Data loads through Gridworks can be tracked here


http://gridworks-loads.freebaseapps.com/

Please try out Gridworks and join us on the mailing list mentioned above 
for discussion!


David

[1] 
http://lists.freebase.com/pipermail/freebase-discuss/2010-March/000860.html




On Mar/28/10 8:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open data 
realm. This tool deals with ETL very well. Of course, for those who 
appreciate OWL, a lot of what's demonstrated in this demo is also 
achievable via "context rules". Bottom line (imho), nice tool that 
will only aid improving Web of Linked Data quality at the data set 
production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks





Re: Nice Data Cleansing Tool Demo

2010-03-29 Thread David Huynh

Hi Aldo,

On Mar/30/10 1:46 am, Aldo Bucchi wrote:

Hi David,

I love it and I NEED it ;)
Awesome work, really.

I heard it will be opensource so I will probably be able to extend it
myself,
Yup, it'll be open source. Clean data sets are all clean the same way, 
but each dirty data set is dirty in its own way. Which is why Gridworks 
needs all the open source contributions in order to cover as many 
different kinds of data dirtiness as possible. :-)



but here are some ideas for (missing?) features:
* Importing custom Lookups/Dictionaries ( to go from text to IDs or
the other way around ). Maybe this is possible using a different hook
for the reconciliation mechanism.
* Related: Plug in other reconciliation services ( not sure how this
stands up to freebase biz alignment )
   
Definitely. Right now Gridworks is hooked up to 2 services: the Freebase 
text search service (called "relevance") and the experimental proper 
reconciliation service. It makes sense to be able to plug in other 
services as well.



* Command line engine. To add a GW project as a step in a traditional
transformation job and execute steps sequentially.
   
We've thought of that, too, but haven't implemented it. That shouldn't 
be too hard.



* Expose Gazetteers ( dictionaries ) generated within the tool ( when
equating facets )
   

That makes sense. I'll think more about how to support that.

David




Re: Nice Data Cleansing Tool Demo

2010-03-29 Thread David Huynh

On Mar/29/10 9:10 pm, François Scharffe wrote:

Hi David,

Great work !

When will the tool will be released ? I can't wait trying it.

Hi François, we're aiming for about 1 more month of development and testing.

David




Re: AW: Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Kingsley Idehen

Peter Haase wrote:

Hi,

  

[SNIP]

<<
What is needed is Top-k plus the right pivot/refinement
operators (which link to new dynamic collections).




Yes, and I am sure you know that the above isn't in anyway 
insurmountable (for Microsoft to implement) bearing in mind the server 
simply has to handle the URL requests it receives as part of its dynamic 
collection assembly process.


Pivot is a game changing client for Linked Data (RDF or OData variants), 
it simply makes life a lot easier for people to comprehend the virtues 
of Faceted Search & Find courtesy of EAV graph models.





--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Kingsley Idehen

Georgi Kobilarov wrote:

Kingsley,

  

So by the time you can
use Pivot on SW/linked data, you will already have solved all the
interesting and challenging problems.

  

This part is what I call an innovation slot since we have hooked it
into



our

  

DBMS hosted faceted engine and successfully used it over very large


data
  

sets.


Kingsley, I'm wondering: How did you do that? I tried it myself, and
it doesn't work.
  

Did I indicate that my demo instance was public? How did you come to
overlook that?



I wasn't referring to a demo of yours, but to the general task of using
Pivot as a frontend to a faceted browsing backend engine. 
  

Re. the general task, it can compliment a back-end.

Have you ever encountered an old concept, from the tabular data 
representation realm (e.g., RDBMS) called "Mirrored Cursors" ? Maybe 
you've encountered "Detached Rowsets" and schemes that also include 
delta handling between the client and the server.


The fundamental point I am making to you is simply this: Pivot is a 
powerful compliment to an HTTP server that can deliver faceted 
navigation, natively (like Virtuoso). The end result is this: you can 
get the server the do some work (localize the first phase of the Faceted 
Search and Find against massive data corpus) and then have the client 
handle the remainder (nice Visual UX for insight discovery).


  

Pivot can't make use of server-side faceted browsing engines.

  

Why do you speculate? You are incorrect and Virtuoso *doing* what you
claim is impossible will be emphatic proof, nice and simple.

Pivot consumes data from HTTP accessible collections (which may be static


or
  

dynamic [1]). A dynamic collection is comprised of CXML resources


(basically
  

XML) .



I don't speculate. Which parts of my "does not work" and "can't use" did
sound like a speculation?  

  
You explicitly said: "Pivot can't make use of server-side faceted 
browsing engines" .


I am saying, based on my earlier comments (clarified further above re. 
mirrored cursor anecdote): It can, will, and you shall see re. Virtuoso.


 
  

You need to send *all* the data to the Pivot client, and it computes
the facets and performs any filtering operation client-side.
  

You make a collection from a huge corpus of data (what I demonstrate) then
you "Save As" (which I demonstrate as the generation point re. CXML
resource) and then Pivot consumes. All the data is Virtuoso hosted.

There are two things you are overlooking:

1. The dynamic collection is produced at the conclusion of Virtuoso based
faceted navigation (the interactions basically describes the Facet
membership to Virtuoso) 2. Pivot works with static and dynamic collections


.
  

*I specifically state, this is about using both products together to solve


a
  

major problem. #1 Faceted Browsing UX #2 Faceting over a huge data
corpus.*

Virtuoso is an HTTP server, it can serve a myriad of representations of


data to
  

user agents (it has its own DBMS hosted XSLT Processor and XML Schema
Validator with XQuery/XPath to boot, all very old stuff).



Yes, you make a collection and "save as" that to CXML, exactly! That is not
"using Pivot as a frontend to Virtuoso". 

I am starting from the Server not the Client.

I am starting from the Server because the Client can't handle the data 
corpus, and wasn't built with that in mind. It was build to consume a 
specific type of resource collection (static or dynamic) via HTTP end of 
story.


Where I start from doesn't invalidate Pivot as a front-end to Virtuoso, 
the entire operation can take place within the  "Pivot Browser" (Pivot 
is an HTTP user agent that operates on a specific data representation 
format).

Sure, you can construct a small
dataset from a huge dataset using SPARQL, or your Virtuoso facet engine or
whatever. And then export that resulting dataset to Pivot collection XML and
load that CXML into Pivot. 
I am not talking about "Export" in the manner you characterize. I am 
talking about an HTTP conversation that results in CXML based resource 
being dispatched from a Server to a User Agent, REST-fully.




But that is very different to using Pivot as a
frontend to a huge data set. 
  

In your world view and eyes, maybe. Absolutely not the case in mine.

I can interact with Virtuoso from start to finish from within Pivot 
(never leaving Pivot). I start by making HTTP requests from Pivot, and 
the entire exercise concludes with an CXML representation of the 
collection assembled by Virtuoso (dynamically).




  

BTW -- how do you think Peter Haase got his variant working? I am sure he
will shed identical light on the matter for you.



Yes, Peter, please do. From what I saw in the Fluidops demo, it works
exactly as I wrote above: A sparql-query constructs a small dataset from the
sparql endpoint, converts that via a proxy to CXML and loads it into Pivot. 


I don't say Pivot d

AW: Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Peter Haase
Hi,

> -Ursprüngliche Nachricht-
> Von: public-lod-requ...@w3.org [mailto:public-lod-requ...@w3.org] Im
> Auftrag von Kingsley Idehen
> Gesendet: Monday, March 29, 2010 8:27 PM
> An: public-lod@w3.org; Georgi Kobilarov
> Betreff: Contd: Nice Data Cleansing Tool Demo
> 
> Georgi Kobilarov wrote:
> > Hello,
> >
> >
> >>>> Now here is the obvious question, re. broader realm of faceted
> data
> >>>> navigation, have you guys digested the underlying concepts
> >>>> demonstrated by Microsoft Pivot?
> >>>>
> >>>>
> >>> I've seen the TED talk on Pivot. It's a very well polished
> >>> implementation of faceted browsing. The Seadragon technology
> >>> integration and animations are well executed. As far as "underlying
> >>> concepts" in faceted browsing go, I haven't noticed anything novel
> >>>
> > there.
> >
> > I agree with David here, nothing novel about the underlying concept.
> > One thing I found quite nice and haven't seen before is grouping
> results
> > along one facet dimension (the bar-graph representation of results).
> I
> > think
> > that is a neat idea.
> > The integration of Seadragon and deep-zooming looks nice, but little
> more
> > than that. Not all objects render into nice pictures, and the
> > interaction of zooming in
> > and out isn't a helpful one in my opinion. The zooming gives the
> > impression
> > at first that the position of objects in that 2D space is meaningful,
> > but it
> > is not.  It's an eye-catcher, not more.
> >
> >
> >
> >>> One thing to note: in each Pivot demo example, there is data of
> >>> exactly one type only--say, type people. So it seems, using
> Microsoft
> >>> Pivot, you can't pivot from one type to another, say, from people
> to
> >>> their companies. You can't do that example I used for Parallax: US
> >>> presidents -> children -> schools. Or skyscrapers -> architects ->
> >>> other buildings. So from what I've seen, as it currently is,
> Microsoft
> >>> Pivot cannot be used for browsing graphs because it cannot pivot
> (over
> >>> graph links).
> >>>
> >> Yes, this is a limitation re. general faceted browsing concepts.
> >>
> >
> > No, it's a limitation of the current implementations of faceted
> browsing.
> > Not a general problem with faceted browsing.
> >

Using dynamic collection you can essentially implement any pivot/query
refinement/filter operator you like, including the ones mentioned above.
It is true that the demo collections from Microsoft do not show this (yet),
but we have some of them in our system at
http://iwb.fluidops.com/pivot


> >
> >> The most interesting part to me is the use of an alternative symbol
> >> mechanism for the human interaction aspect i.e., deep zoom images
> where
> >> you would typically see a long human unfriendly URI.
> >>
> >
> > "Where you would typically see URIs"? Really?
> 
> **clean up post re. some critical typos **
> 
> Where would you see URIs? What do you see when you use:
> http://lod.openlinksw.com ?
> 
> And when you don't see URIs (human or machine, the typical case re.
> Faceted Browsing over RDF) what do you have re. HTTP based Linked Data?
> Zilch!
> >
> >
> >>> Furthermore, I believe that to get Pivot to perform well, you need
> a
> >>> cleaned up, *homogeneous* data set, presumably of small size (see
> >>> their Wikipedia example in which they picked only the top 500 most
> >>> visited articles). SW/linked data in their natural habitat,
> however,
> >>> is rarely that cleaned up and homogeneous ...

Yes, ideally you have clean homogeneous data. However, in our demonstrator
we do operate on a larger, un-cleaned LOD data set, incl. DBpedia (>3Mio
entities) and several others (around 200Mio triples in total). Clearly, you
see the problems in the data (missing images, wrong images, duplicate
values, ...) Still, I see it from a positive side: I believe that for many
information needs, visual exploration is a very effective paradigm, and with
such a great tool like Pivot one can achieve a phenomenal user experience.
And it is possible to show that with real LOD data already today. 
As Georgi said, the data quality will improve over time. Visual exploration
tools like Pivot - where you actually *see* the problems - might help on
this front.



> > Is  that really a problem of Linked Data Web a

Re: Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Aldo Bucchi
Hi,

On Mon, Mar 29, 2010 at 3:22 PM, Nathan  wrote:
> Georgi Kobilarov wrote:
>> Kingsley,
>>
>> So by the time you can
>> use Pivot on SW/linked data, you will already have solved all the
>> interesting and challenging problems.
>>
> This part is what I call an innovation slot since we have hooked it
> into
>
 our

> DBMS hosted faceted engine and successfully used it over very large
>> data
> sets.
 Kingsley, I'm wondering: How did you do that? I tried it myself, and
 it doesn't work.
>>> Did I indicate that my demo instance was public? How did you come to
>>> overlook that?
>>
>> I wasn't referring to a demo of yours, but to the general task of using
>> Pivot as a frontend to a faceted browsing backend engine.
>>
>>
 Pivot can't make use of server-side faceted browsing engines.

>>> Why do you speculate? You are incorrect and Virtuoso *doing* what you
>>> claim is impossible will be emphatic proof, nice and simple.
>>>
>>> Pivot consumes data from HTTP accessible collections (which may be static
>> or
>>> dynamic [1]). A dynamic collection is comprised of CXML resources
>> (basically
>>> XML) .
>>
>> I don't speculate. Which parts of my "does not work" and "can't use" did
>> sound like a speculation?
>>
>>
 You need to send *all* the data to the Pivot client, and it computes
 the facets and performs any filtering operation client-side.
>>> You make a collection from a huge corpus of data (what I demonstrate) then
>>> you "Save As" (which I demonstrate as the generation point re. CXML
>>> resource) and then Pivot consumes. All the data is Virtuoso hosted.
>>>
>>> There are two things you are overlooking:
>>>
>>> 1. The dynamic collection is produced at the conclusion of Virtuoso based
>>> faceted navigation (the interactions basically describes the Facet
>>> membership to Virtuoso) 2. Pivot works with static and dynamic collections
>> .
>>> *I specifically state, this is about using both products together to solve
>> a
>>> major problem. #1 Faceted Browsing UX #2 Faceting over a huge data
>>> corpus.*
>>>
>>> Virtuoso is an HTTP server, it can serve a myriad of representations of
>> data to
>>> user agents (it has its own DBMS hosted XSLT Processor and XML Schema
>>> Validator with XQuery/XPath to boot, all very old stuff).
>>
>> Yes, you make a collection and "save as" that to CXML, exactly! That is not
>> "using Pivot as a frontend to Virtuoso". Sure, you can construct a small
>> dataset from a huge dataset using SPARQL, or your Virtuoso facet engine or
>> whatever. And then export that resulting dataset to Pivot collection XML and
>> load that CXML into Pivot. But that is very different to using Pivot as a
>> frontend to a huge data set.
>>
>>
>>> BTW -- how do you think Peter Haase got his variant working? I am sure he
>>> will shed identical light on the matter for you.
>>
>> Yes, Peter, please do. From what I saw in the Fluidops demo, it works
>> exactly as I wrote above: A sparql-query constructs a small dataset from the
>> sparql endpoint, converts that via a proxy to CXML and loads it into Pivot.
>>
>> I don't say Pivot doesn't make a nice demo, or a useful tool to explore a
>> small dataset via faceted filtering. But it's not a frontend that can be put
>> on top of a faceted browsing engine like
>> http://developer.nytimes.com/docs/article_search_api
>>
>
> The last thing I want is an argument about this; but surely virtually
> every service in the world; faceted browsing included, works by querying
> a large dataset to get a smaller set of results, transforming it in to a
> the needed format an then displaying? sounds like every system I've ever
> seen from the simple html view of an sql query right up to the mighty
> google itself.
>
> Maybe I'm being naive here; what am I missing?

Nathan,

You're not missing much. From what I see:
Georgi's point is that the level of integration is not ideal. It is
basically a "load" style integration, not a "connect" style
integration.
Kingsley's point is that they "can" be integrated, and he has a demo
to prove it.

Both are right ;)

I can relate to both but I lean towards Kingsley's because he is, as
usual, projecting. He knows that this integration is enough to make a
point, and that the rest will happen.
Show the value! The architecture will follow. ( this is what M$ does
all the time ). Plus they already have a lock-in on the runtime side
and seadragon tech, so I think they can afford to open the platform up
some more on the integration side of things.

Regards,
A

>
> Many Regards,
>
> Nathan
>
>



-- 
Aldo Bucchi
skype:aldo.bucchi
http://www.univrz.com/
http://aldobucchi.com/

PRIVILEGED AND CONFIDENTIAL INFORMATION
This message is only for the use of the individual or entity to which it is
addressed and may contain information that is privileged and confidential. If
you are not the intended recipient, please do not distribute or copy this
communication, by e-mail or otherwise. In

Re: Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Nathan
Georgi Kobilarov wrote:
> Kingsley,
> 
> So by the time you can
> use Pivot on SW/linked data, you will already have solved all the
> interesting and challenging problems.
>
 This part is what I call an innovation slot since we have hooked it
 into

>>> our
>>>
 DBMS hosted faceted engine and successfully used it over very large
> data
 sets.
>>> Kingsley, I'm wondering: How did you do that? I tried it myself, and
>>> it doesn't work.
>> Did I indicate that my demo instance was public? How did you come to
>> overlook that?
> 
> I wasn't referring to a demo of yours, but to the general task of using
> Pivot as a frontend to a faceted browsing backend engine. 
> 
> 
>>> Pivot can't make use of server-side faceted browsing engines.
>>>
>> Why do you speculate? You are incorrect and Virtuoso *doing* what you
>> claim is impossible will be emphatic proof, nice and simple.
>>
>> Pivot consumes data from HTTP accessible collections (which may be static
> or
>> dynamic [1]). A dynamic collection is comprised of CXML resources
> (basically
>> XML) .
> 
> I don't speculate. Which parts of my "does not work" and "can't use" did
> sound like a speculation?  
> 
>  
>>> You need to send *all* the data to the Pivot client, and it computes
>>> the facets and performs any filtering operation client-side.
>> You make a collection from a huge corpus of data (what I demonstrate) then
>> you "Save As" (which I demonstrate as the generation point re. CXML
>> resource) and then Pivot consumes. All the data is Virtuoso hosted.
>>
>> There are two things you are overlooking:
>>
>> 1. The dynamic collection is produced at the conclusion of Virtuoso based
>> faceted navigation (the interactions basically describes the Facet
>> membership to Virtuoso) 2. Pivot works with static and dynamic collections
> .
>> *I specifically state, this is about using both products together to solve
> a
>> major problem. #1 Faceted Browsing UX #2 Faceting over a huge data
>> corpus.*
>>
>> Virtuoso is an HTTP server, it can serve a myriad of representations of
> data to
>> user agents (it has its own DBMS hosted XSLT Processor and XML Schema
>> Validator with XQuery/XPath to boot, all very old stuff).
> 
> Yes, you make a collection and "save as" that to CXML, exactly! That is not
> "using Pivot as a frontend to Virtuoso". Sure, you can construct a small
> dataset from a huge dataset using SPARQL, or your Virtuoso facet engine or
> whatever. And then export that resulting dataset to Pivot collection XML and
> load that CXML into Pivot. But that is very different to using Pivot as a
> frontend to a huge data set. 
> 
> 
>> BTW -- how do you think Peter Haase got his variant working? I am sure he
>> will shed identical light on the matter for you.
> 
> Yes, Peter, please do. From what I saw in the Fluidops demo, it works
> exactly as I wrote above: A sparql-query constructs a small dataset from the
> sparql endpoint, converts that via a proxy to CXML and loads it into Pivot. 
> 
> I don't say Pivot doesn't make a nice demo, or a useful tool to explore a
> small dataset via faceted filtering. But it's not a frontend that can be put
> on top of a faceted browsing engine like
> http://developer.nytimes.com/docs/article_search_api
> 

The last thing I want is an argument about this; but surely virtually
every service in the world; faceted browsing included, works by querying
a large dataset to get a smaller set of results, transforming it in to a
the needed format an then displaying? sounds like every system I've ever
seen from the simple html view of an sql query right up to the mighty
google itself.

Maybe I'm being naive here; what am I missing?

Many Regards,

Nathan



RE: Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Georgi Kobilarov
Kingsley,

> >>> So by the time you can
> >>> use Pivot on SW/linked data, you will already have solved all the
> >>> interesting and challenging problems.
> >>>
> >> This part is what I call an innovation slot since we have hooked it
> >> into
> >>
> > our
> >
> >> DBMS hosted faceted engine and successfully used it over very large
data
> >> sets.
> >
> > Kingsley, I'm wondering: How did you do that? I tried it myself, and
> > it doesn't work.
> 
> Did I indicate that my demo instance was public? How did you come to
> overlook that?

I wasn't referring to a demo of yours, but to the general task of using
Pivot as a frontend to a faceted browsing backend engine. 


> > Pivot can't make use of server-side faceted browsing engines.
> >
> 
> Why do you speculate? You are incorrect and Virtuoso *doing* what you
> claim is impossible will be emphatic proof, nice and simple.
> 
> Pivot consumes data from HTTP accessible collections (which may be static
or
> dynamic [1]). A dynamic collection is comprised of CXML resources
(basically
> XML) .

I don't speculate. Which parts of my "does not work" and "can't use" did
sound like a speculation?  

 
> > You need to send *all* the data to the Pivot client, and it computes
> > the facets and performs any filtering operation client-side.
> 
> You make a collection from a huge corpus of data (what I demonstrate) then
> you "Save As" (which I demonstrate as the generation point re. CXML
> resource) and then Pivot consumes. All the data is Virtuoso hosted.
> 
> There are two things you are overlooking:
> 
> 1. The dynamic collection is produced at the conclusion of Virtuoso based
> faceted navigation (the interactions basically describes the Facet
> membership to Virtuoso) 2. Pivot works with static and dynamic collections
.
> 
> *I specifically state, this is about using both products together to solve
a
> major problem. #1 Faceted Browsing UX #2 Faceting over a huge data
> corpus.*
> 
> Virtuoso is an HTTP server, it can serve a myriad of representations of
data to
> user agents (it has its own DBMS hosted XSLT Processor and XML Schema
> Validator with XQuery/XPath to boot, all very old stuff).

Yes, you make a collection and "save as" that to CXML, exactly! That is not
"using Pivot as a frontend to Virtuoso". Sure, you can construct a small
dataset from a huge dataset using SPARQL, or your Virtuoso facet engine or
whatever. And then export that resulting dataset to Pivot collection XML and
load that CXML into Pivot. But that is very different to using Pivot as a
frontend to a huge data set. 


> BTW -- how do you think Peter Haase got his variant working? I am sure he
> will shed identical light on the matter for you.

Yes, Peter, please do. From what I saw in the Fluidops demo, it works
exactly as I wrote above: A sparql-query constructs a small dataset from the
sparql endpoint, converts that via a proxy to CXML and loads it into Pivot. 

I don't say Pivot doesn't make a nice demo, or a useful tool to explore a
small dataset via faceted filtering. But it's not a frontend that can be put
on top of a faceted browsing engine like
http://developer.nytimes.com/docs/article_search_api

Georgi

--
Georgi Kobilarov
Uberblic Labs Berlin
http://blog.georgikobilarov.com





Contd: Nice Data Cleansing Tool Demo

2010-03-29 Thread Kingsley Idehen

Georgi Kobilarov wrote:

Hello,

 

Now here is the obvious question, re. broader realm of faceted data
navigation, have you guys digested the underlying concepts
demonstrated by Microsoft Pivot?



I've seen the TED talk on Pivot. It's a very well polished
implementation of faceted browsing. The Seadragon technology
integration and animations are well executed. As far as "underlying
concepts" in faceted browsing go, I haven't noticed anything novel
  

there.

I agree with David here, nothing novel about the underlying concept. 
One thing I found quite nice and haven't seen before is grouping results
along one facet dimension (the bar-graph representation of results). I 
think

that is a neat idea.
The integration of Seadragon and deep-zooming looks nice, but little more
than that. Not all objects render into nice pictures, and the 
interaction of zooming in
and out isn't a helpful one in my opinion. The zooming gives the 
impression
at first that the position of objects in that 2D space is meaningful, 
but it

is not.  It's an eye-catcher, not more.


 

One thing to note: in each Pivot demo example, there is data of
exactly one type only--say, type people. So it seems, using Microsoft
Pivot, you can't pivot from one type to another, say, from people to
their companies. You can't do that example I used for Parallax: US
presidents -> children -> schools. Or skyscrapers -> architects ->
other buildings. So from what I've seen, as it currently is, Microsoft
Pivot cannot be used for browsing graphs because it cannot pivot (over
graph links).
  

Yes, this is a limitation re. general faceted browsing concepts.



No, it's a limitation of the current implementations of faceted browsing.
Not a general problem with faceted browsing.


 

The most interesting part to me is the use of an alternative symbol
mechanism for the human interaction aspect i.e., deep zoom images where
you would typically see a long human unfriendly URI.



"Where you would typically see URIs"? Really? 


**clean up post re. some critical typos **

Where would you see URIs? What do you see when you use: 
http://lod.openlinksw.com ?


And when you don't see URIs (human or machine, the typical case re. 
Faceted Browsing over RDF) what do you have re. HTTP based Linked Data? 
Zilch!


 

Furthermore, I believe that to get Pivot to perform well, you need a
cleaned up, *homogeneous* data set, presumably of small size (see
their Wikipedia example in which they picked only the top 500 most
visited articles). SW/linked data in their natural habitat, however,
is rarely that cleaned up and homogeneous ...   


Is  that really a problem of Linked Data Web as such? I don't think so.
There is a lot of badly structured, not well cleaned up data on the 
current

Linked Data Web. Because there was so much excitement about publishing
anything in the early day, and so little attention to the actual data 
that's

getting published. That is going to change.
 

So by the time you can
use Pivot on SW/linked data, you will already have solved all the
interesting and challenging problems.
  

This part is what I call an innovation slot since we have hooked it into


our
 

DBMS hosted faceted engine and successfully used it over very large data
sets. 


Kingsley, I'm wondering: How did you do that? I tried it myself, and it
doesn't work.


Did I indicate that my demo instance was public? How did you come to 
overlook that?



Pivot can't make use of server-side faceted browsing engines.
 


Why do you speculate? You are incorrect and Virtuoso *doing* what you 
claim is impossible will be emphatic proof, nice and simple.


Pivot consumes data from HTTP accessible collections (which may be 
static or dynamic [1]). A dynamic collection is comprised of CXML 
resources (basically XML) .



You need to send *all* the data to the Pivot client, and it computes the
facets and performs any filtering operation client-side. 


You make a collection from a huge corpus of data (what I demonstrate) 
then you "Save As" (which I demonstrate as the generation point re. CXML 
resource) and then Pivot consumes. All the data is Virtuoso hosted.


There are two things you are overlooking:

1. The dynamic collection is produced at the conclusion of Virtuoso 
based faceted navigation (the interactions basically describes the Facet 
membership to Virtuoso)

2. Pivot works with static and dynamic collections .

*I specifically state, this is about using both products together to 
solve a major problem. #1 Faceted Browsing UX #2 Faceting over a huge 
data corpus.*


Virtuoso is an HTTP server, it can serve a myriad of representations of 
data to user agents (it has its own DBMS hosted XSLT Processor and XML 
Schema Validator with XQuery/XPath to boot, all very old stuff).



BTW -- how do you think Peter Haase got his variant working? I am sure 
he will shed identical light on the matter for you.


Links:

1. http://www.getpivot.com/developer-info/ 

Re: Nice Data Cleansing Tool Demo

2010-03-29 Thread Kingsley Idehen

Georgi Kobilarov wrote:

Hello,

  

Now here is the obvious question, re. broader realm of faceted data
navigation, have you guys digested the underlying concepts
demonstrated by Microsoft Pivot?



I've seen the TED talk on Pivot. It's a very well polished
implementation of faceted browsing. The Seadragon technology
integration and animations are well executed. As far as "underlying
concepts" in faceted browsing go, I haven't noticed anything novel
  

there.

I agree with David here, nothing novel about the underlying concept. 
One thing I found quite nice and haven't seen before is grouping results

along one facet dimension (the bar-graph representation of results). I think
that is a neat idea. 


The integration of Seadragon and deep-zooming looks nice, but little more
than that. 
Not all objects render into nice pictures, and the interaction of zooming in

and out isn't a helpful one in my opinion. The zooming gives the impression
at first that the position of objects in that 2D space is meaningful, but it
is not.  
It's an eye-catcher, not more.



  

One thing to note: in each Pivot demo example, there is data of
exactly one type only--say, type people. So it seems, using Microsoft
Pivot, you can't pivot from one type to another, say, from people to
their companies. You can't do that example I used for Parallax: US
presidents -> children -> schools. Or skyscrapers -> architects ->
other buildings. So from what I've seen, as it currently is, Microsoft
Pivot cannot be used for browsing graphs because it cannot pivot (over
graph links).
  

Yes, this is a limitation re. general faceted browsing concepts.



No, it's a limitation of the current implementations of faceted browsing.
Not a general problem with faceted browsing.


  

The most interesting part to me is the use of an alternative symbol
mechanism for the human interaction aspect i.e., deep zoom images where
you would typically see a long human unfriendly URI.



"Where you would typically see URIs"? Really? 
  
Where would you see URIs? What do you see when you use: 
http://lod.openlinksw.com ?


And when you don't see URIs (human or machine, the typical case re. 
Faceted Browsing over RDF) what do you have re. HTTP based Linked Data? 
Zilch!


  

Furthermore, I believe that to get Pivot to perform well, you need a
cleaned up, *homogeneous* data set, presumably of small size (see
their Wikipedia example in which they picked only the top 500 most
visited articles). SW/linked data in their natural habitat, however,
is rarely that cleaned up and homogeneous ... 
  


Is  that really a problem of Linked Data Web as such? I don't think so.
There is a lot of badly structured, not well cleaned up data on the current
Linked Data Web. Because there was so much excitement about publishing
anything in the early day, and so little attention to the actual data that's
getting published. That is going to change. 

  

So by the time you can
use Pivot on SW/linked data, you will already have solved all the
interesting and challenging problems.
  

This part is what I call an innovation slot since we have hooked it into


our
  

DBMS hosted faceted engine and successfully used it over very large data
sets. 



Kingsley, I'm wondering: How did you do that? I tried it myself, and it
doesn't work. 
Did I indicate that my demo instance was public? How did you come to 
overlook that?

Pivot can't make use of server-side faceted browsing engines.
  
Why do you speculate? You are incorrect and Virtuoso do what you claim 
is impossible will be emphatic proof, nice and simple.


Pivot consumes data from HTTP accessible collections (which may be 
static or dynamic [1]). A dynamic collection is comprised of CXML 
resources (basically XML) .

You need to send *all* the data to the Pivot client, and it computes the
facets and performs any filtering operation client-side. 


You make a collection from a huge corpus of data (what I demonstrate) 
then you "Save As" (which I demonstrate as the generation point re. CXML 
resource) and then Pivot consumes. All the data is Virtuoso hosted.


There are two things you a overlooking:

1. The dynamic collection is produced at the conclusion of Virtuoso 
based faceted navigation (the interactions basically describes the Facet 
membership to Virtuoso)

2. Pivot works with static and dynamic collections

Virtuoso is an HTTP server, it can serve a myriad of  representations of 
data to user agents (it has its own DBMS hosted XSLT Processor and XML 
Schema Validator with XQuery/XPath to boot, all very old stuff).



BTW -- how do you think Peter Haase got his variant working? I am sure 
he will shed identical light on the matter for you.


Links:

1. http://www.getpivot.com/developer-info/ --- Please note Unbounded 
Dynamic Collections
2. http://www.getpivot.com/developer-info/hosting.aspx#Dynamic -- Look 
at the diagram then revist the architecture of Virtuoso (its a Hybrid 
Data Server that

Re: Nice Data Cleansing Tool Demo

2010-03-29 Thread Aldo Bucchi
Hi David,

I love it and I NEED it ;)
Awesome work, really.

I heard it will be opensource so I will probably be able to extend it
myself, but here are some ideas for (missing?) features:
* Importing custom Lookups/Dictionaries ( to go from text to IDs or
the other way around ). Maybe this is possible using a different hook
for the reconciliation mechanism.
* Related: Plug in other reconciliation services ( not sure how this
stands up to freebase biz alignment )
* Command line engine. To add a GW project as a step in a traditional
transformation job and execute steps sequentially.
* Expose Gazetteers ( dictionaries ) generated within the tool ( when
equating facets )

I have other ideas but I need to try it first it looks like you've
covered a lot of ground here.

Amazing, Amazing. Thanks!
A


On Sun, Mar 28, 2010 at 8:06 PM, David Huynh  wrote:
> On Mar/29/10 12:31 am, Kingsley Idehen wrote:
>
> All,
>
> A very nice data cleansing tool from David and Co. at Freebase.
>
> CSVs are clearly the dominant data format in the structured open data realm.
> This tool deals with ETL very well. Of course, for those who appreciate OWL,
> a lot of what's demonstrated in this demo is also achievable via "context
> rules". Bottom line (imho), nice tool that will only aid improving Web of
> Linked Data quality at the data set production stage.
>
> Links:
>
> 1. http://vimeo.com/10081183 -- Freebase Gridworks
>
> Thanks, Kingsley. The second screencast, by Stefano Mazzocchi, also
> demonstrates a few other interesting features:
>
>     http://www.vimeo.com/10287824
>
> David
>



-- 
Aldo Bucchi
skype:aldo.bucchi
http://www.univrz.com/
http://aldobucchi.com/

PRIVILEGED AND CONFIDENTIAL INFORMATION
This message is only for the use of the individual or entity to which it is
addressed and may contain information that is privileged and confidential. If
you are not the intended recipient, please do not distribute or copy this
communication, by e-mail or otherwise. Instead, please notify us immediately by
return e-mail.



RE: Nice Data Cleansing Tool Demo

2010-03-29 Thread Georgi Kobilarov
Hello,

> >> Now here is the obvious question, re. broader realm of faceted data
> >> navigation, have you guys digested the underlying concepts
> >> demonstrated by Microsoft Pivot?
> >>
> >
> > I've seen the TED talk on Pivot. It's a very well polished
> > implementation of faceted browsing. The Seadragon technology
> > integration and animations are well executed. As far as "underlying
> > concepts" in faceted browsing go, I haven't noticed anything novel
there.

I agree with David here, nothing novel about the underlying concept. 
One thing I found quite nice and haven't seen before is grouping results
along one facet dimension (the bar-graph representation of results). I think
that is a neat idea. 

The integration of Seadragon and deep-zooming looks nice, but little more
than that. 
Not all objects render into nice pictures, and the interaction of zooming in
and out isn't a helpful one in my opinion. The zooming gives the impression
at first that the position of objects in that 2D space is meaningful, but it
is not.  
It's an eye-catcher, not more.


> > One thing to note: in each Pivot demo example, there is data of
> > exactly one type only--say, type people. So it seems, using Microsoft
> > Pivot, you can't pivot from one type to another, say, from people to
> > their companies. You can't do that example I used for Parallax: US
> > presidents -> children -> schools. Or skyscrapers -> architects ->
> > other buildings. So from what I've seen, as it currently is, Microsoft
> > Pivot cannot be used for browsing graphs because it cannot pivot (over
> > graph links).
> Yes, this is a limitation re. general faceted browsing concepts.

No, it's a limitation of the current implementations of faceted browsing.
Not a general problem with faceted browsing.


> The most interesting part to me is the use of an alternative symbol
> mechanism for the human interaction aspect i.e., deep zoom images where
> you would typically see a long human unfriendly URI.

"Where you would typically see URIs"? Really? 


> > Furthermore, I believe that to get Pivot to perform well, you need a
> > cleaned up, *homogeneous* data set, presumably of small size (see
> > their Wikipedia example in which they picked only the top 500 most
> > visited articles). SW/linked data in their natural habitat, however,
> > is rarely that cleaned up and homogeneous ... 

Is  that really a problem of Linked Data Web as such? I don't think so.
There is a lot of badly structured, not well cleaned up data on the current
Linked Data Web. Because there was so much excitement about publishing
anything in the early day, and so little attention to the actual data that's
getting published. That is going to change. 

> > So by the time you can
> > use Pivot on SW/linked data, you will already have solved all the
> > interesting and challenging problems.
> This part is what I call an innovation slot since we have hooked it into
our
> DBMS hosted faceted engine and successfully used it over very large data
> sets. 

Kingsley, I'm wondering: How did you do that? I tried it myself, and it
doesn't work. Pivot can't make use of server-side faceted browsing engines.
You need to send *all* the data to the Pivot client, and it computes the
facets and performs any filtering operation client-side. Works well for up
to around 1k objects, but that's it. Pivot's architecture is in that sense
very much like Exhibit in Silverlight.


Best,
Georgi

--
Georgi Kobilarov
Uberblic Labs Berlin
http://blog.georgikobilarov.com





Re: Nice Data Cleansing Tool Demo

2010-03-29 Thread François Scharffe

Hi David,

Great work !

When will the tool will be released ? I can't wait trying it.

Cheers,
François

David Huynh wrote:

On Mar/29/10 12:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open data 
realm. This tool deals with ETL very well. Of course, for those who 
appreciate OWL, a lot of what's demonstrated in this demo is also 
achievable via "context rules". Bottom line (imho), nice tool that 
will only aid improving Web of Linked Data quality at the data set 
production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks

Thanks, Kingsley. The second screencast, by Stefano Mazzocchi, also 
demonstrates a few other interesting features:


http://www.vimeo.com/10287824

David





Re: Nice Data Cleansing Tool Demo

2010-03-29 Thread Kingsley Idehen

David Huynh wrote:

On Mar/29/10 10:01 am, Kingsley Idehen wrote:

David Huynh wrote:

On Mar/29/10 12:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open 
data realm. This tool deals with ETL very well. Of course, for 
those who appreciate OWL, a lot of what's demonstrated in this demo 
is also achievable via "context rules". Bottom line (imho), nice 
tool that will only aid improving Web of Linked Data quality at the 
data set production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks

Thanks, Kingsley. The second screencast, by Stefano Mazzocchi, also 
demonstrates a few other interesting features:


http://www.vimeo.com/10287824

David

David,

Yes, very nice!

Now here is the obvious question, re. broader realm of faceted data 
navigation, have you guys digested the underlying concepts 
demonstrated by Microsoft Pivot?




I've seen the TED talk on Pivot. It's a very well polished 
implementation of faceted browsing. The Seadragon technology 
integration and animations are well executed. As far as "underlying 
concepts" in faceted browsing go, I haven't noticed anything novel there.


One thing to note: in each Pivot demo example, there is data of 
exactly one type only--say, type people. So it seems, using Microsoft 
Pivot, you can't pivot from one type to another, say, from people to 
their companies. You can't do that example I used for Parallax: US 
presidents -> children -> schools. Or skyscrapers -> architects -> 
other buildings. So from what I've seen, as it currently is, Microsoft 
Pivot cannot be used for browsing graphs because it cannot pivot (over 
graph links).

Yes, this is a limitation re. general faceted browsing concepts.


The most interesting part to me is the use of an alternative symbol 
mechanism for the human interaction aspect i.e., deep zoom images where 
you would typically see a long human unfriendly URI.


Furthermore, I believe that to get Pivot to perform well, you need a 
cleaned up, *homogeneous* data set, presumably of small size (see 
their Wikipedia example in which they picked only the top 500 most 
visited articles). SW/linked data in their natural habitat, however, 
is rarely that cleaned up and homogeneous ... So by the time you can 
use Pivot on SW/linked data, you will already have solved all the 
interesting and challenging problems.
This part is what I call an innovation slot since we have hooked it into 
our DBMS hosted faceted engine and successfully used it over very large 
data sets. Of course it means that we've  implement some internal tweaks 
re. the alternative identifiers symbols, but once that was done, it was 
back to letting our engine do its thing re. huge data set navigation and 
the ability to expose Entity-Attribute-Value graph model based 
hypermedia resources in a variety of data representations (functionality 
that lies at the very core of Virtuoso)  etc..


I do applaud their recent offering of the Pivot widget for embedding 
into any arbitrary site. That should make faceted browsing more 
accessible to web authors, as Exhibit has done. Pivot is way more 
polished and hopefully scales better than Exhibit, although Exhibit is 
more malleable as a piece of software.

Nice assessment :-)

We will soon unveil versions of our live instances (LOD Cloud Cache, 
DBpedia etc..) that work with Pivot as the client via dynamic 
collections. There is a fundamental feature in Virtuoso (what we call 
Anytime Query) that is essential to delivering this functionality. It is 
my hope that via Pivot (for which dynamic collections are extremely 
challenging) we can make comprehension a little clearer. What I describe 
is a general DBMS engine tweak (it goes beyond RDF data management).


Links:

1. http://www.youtube.com/watch?v=G29DBIEcIuQ -- a quick and dirty 
screencast I published post confirmation that our goals had been 
achieved re. huge RDF data sets navigation via Pivot


2. http://bit.ly/9mj7Fw -- old presentation covering our DBMS hosted 
faceted browser engine + Anytime Query feature for handling huge data 
sets at Web scale.



Kingsley



David





--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: Nice Data Cleansing Tool Demo

2010-03-28 Thread David Huynh

On Mar/29/10 10:01 am, Kingsley Idehen wrote:

David Huynh wrote:

On Mar/29/10 12:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open 
data realm. This tool deals with ETL very well. Of course, for those 
who appreciate OWL, a lot of what's demonstrated in this demo is 
also achievable via "context rules". Bottom line (imho), nice tool 
that will only aid improving Web of Linked Data quality at the data 
set production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks

Thanks, Kingsley. The second screencast, by Stefano Mazzocchi, also 
demonstrates a few other interesting features:


http://www.vimeo.com/10287824

David

David,

Yes, very nice!

Now here is the obvious question, re. broader realm of faceted data 
navigation, have you guys digested the underlying concepts 
demonstrated by Microsoft Pivot?




I've seen the TED talk on Pivot. It's a very well polished 
implementation of faceted browsing. The Seadragon technology integration 
and animations are well executed. As far as "underlying concepts" in 
faceted browsing go, I haven't noticed anything novel there.


One thing to note: in each Pivot demo example, there is data of exactly 
one type only--say, type people. So it seems, using Microsoft Pivot, you 
can't pivot from one type to another, say, from people to their 
companies. You can't do that example I used for Parallax: US presidents 
-> children -> schools. Or skyscrapers -> architects -> other buildings. 
So from what I've seen, as it currently is, Microsoft Pivot cannot be 
used for browsing graphs because it cannot pivot (over graph links).


Furthermore, I believe that to get Pivot to perform well, you need a 
cleaned up, *homogeneous* data set, presumably of small size (see their 
Wikipedia example in which they picked only the top 500 most visited 
articles). SW/linked data in their natural habitat, however, is rarely 
that cleaned up and homogeneous ... So by the time you can use Pivot on 
SW/linked data, you will already have solved all the interesting and 
challenging problems.


I do applaud their recent offering of the Pivot widget for embedding 
into any arbitrary site. That should make faceted browsing more 
accessible to web authors, as Exhibit has done. Pivot is way more 
polished and hopefully scales better than Exhibit, although Exhibit is 
more malleable as a piece of software.


David




Re: Nice Data Cleansing Tool Demo

2010-03-28 Thread Kingsley Idehen

David Huynh wrote:

On Mar/29/10 12:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open data 
realm. This tool deals with ETL very well. Of course, for those who 
appreciate OWL, a lot of what's demonstrated in this demo is also 
achievable via "context rules". Bottom line (imho), nice tool that 
will only aid improving Web of Linked Data quality at the data set 
production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks

Thanks, Kingsley. The second screencast, by Stefano Mazzocchi, also 
demonstrates a few other interesting features:


http://www.vimeo.com/10287824

David

David,

Yes, very nice!

Now here is the obvious question, re. broader realm of faceted data 
navigation, have you guys digested the underlying concepts demonstrated 
by Microsoft Pivot?


--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: Nice Data Cleansing Tool Demo

2010-03-28 Thread David Huynh

On Mar/29/10 12:31 am, Kingsley Idehen wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open data 
realm. This tool deals with ETL very well. Of course, for those who 
appreciate OWL, a lot of what's demonstrated in this demo is also 
achievable via "context rules". Bottom line (imho), nice tool that 
will only aid improving Web of Linked Data quality at the data set 
production stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks

Thanks, Kingsley. The second screencast, by Stefano Mazzocchi, also 
demonstrates a few other interesting features:


http://www.vimeo.com/10287824

David


Re: [uk-government-data-developers] Nice Data Cleansing Tool Demo

2010-03-28 Thread Kingsley Idehen

Leigh Dodds wrote:

Hi,

On Sunday, March 28, 2010, Kingsley Idehen  wrote:
  

All,

A very nice data cleansing tool from David and Co. at Freebase.



Yes, it looks very nice. Am looking forward to working with it.

  

CSVs are clearly the dominant data format in the structured open data
realm. This tool deals with ETL very well. Of course, for those who
appreciate OWL, a lot of what's demonstrated in this demo is also
achievable via "context rules".



Can you (or others) expand on that?

Much of the power in the demo seemed to me to be in the facetting,
scripting of cleansing, analysis of value spaces, etc.

I'd be interested to know how OWL could be applied here.

Cheers,

L.

  

Leigh,

OWL comes in post load of the data into the Quad Store (clean or dirty). 
Note, this demo is based on Literal values cleansing. When you have data 
object identifiers in play you aren't confined to joining data via 
Literal Values (key difference between RDBMS realm and RDF and other 
Graph Model realms).


1. Co-reference - via owl:sameAs assertions
2. Dirty Data - use of procedure functions and inverse functional 
properties
3. Units of Measurement - leveraging locale prowess of HTTP re. ability 
to identify locale of user agents combined with TCN QoS algorithms 
(which can be part of SPARQL as we've done re. Virtuoso)


You can make rules that incorporate all of the above, you can even do so 
with SPARQL (plus function/magic predicates) as the Rules Language for 
constrained forward-chaining in more extreme cases.


I can load a dirty CSV file into Virtuoso, and leverage OWL, SPARQL, 
Function/Magic Predicates en route to handling:


1. Semantic Disparity
2. Structural Disparity
3. Entity Co-References.

Naturally, someone could, and eventually would, write a data 
reconciliation tool that looked like Microsoft Access and basically 
delivered delivered on the above, while simply ridding Virtuoso engines 
(ditto any other Quad Store with similar capabilities). Its all going to 
happen quicker than most will expect, especially now that OData is part 
of the mix re. granular structured linked data, and the universal nature 
of the Entity-Attribute-Value model is getting clearer to broader 
audiences by the second :-)


Links:

1. http://bit.ly/csFCqC -- Data Reconciliation using TimBL as subject 
(note the co-reference and indirect-coference tab data which offers a 
teaser) .


--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: [uk-government-data-developers] Nice Data Cleansing Tool Demo

2010-03-28 Thread Leigh Dodds
Hi,

On Sunday, March 28, 2010, Kingsley Idehen  wrote:
> All,
>
> A very nice data cleansing tool from David and Co. at Freebase.

Yes, it looks very nice. Am looking forward to working with it.

> CSVs are clearly the dominant data format in the structured open data
> realm. This tool deals with ETL very well. Of course, for those who
> appreciate OWL, a lot of what's demonstrated in this demo is also
> achievable via "context rules".

Can you (or others) expand on that?

Much of the power in the demo seemed to me to be in the facetting,
scripting of cleansing, analysis of value spaces, etc.

I'd be interested to know how OWL could be applied here.

Cheers,

L.

-- 
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh.do...@talis.com
http://www.talis.com



Nice Data Cleansing Tool Demo

2010-03-28 Thread Kingsley Idehen

All,

A very nice data cleansing tool from David and Co. at Freebase.

CSVs are clearly the dominant data format in the structured open data 
realm. This tool deals with ETL very well. Of course, for those who 
appreciate OWL, a lot of what's demonstrated in this demo is also 
achievable via "context rules". Bottom line (imho), nice tool that will 
only aid improving Web of Linked Data quality at the data set production 
stage.


Links:

1. http://vimeo.com/10081183 -- Freebase Gridworks

--

Regards,

Kingsley Idehen	  
President & CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen