Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-15 Thread Stas Malyshev
Hi!

> This is a bit tangential to the topic, but isn’t that basically what
> schema.org was developed for? (I’m not sure if that’s still its primary
> purpose, but as far as I know it was started by a group of search
> engines to develop a unified format websites could use to make their
> semantics more accessible to those search engines.)

There are a number of schemas, like Dublin Core, that try to address
issues like that. However, none is even close to what we're talking
about - covering several thousands properties that change all the time.
They have very basic things covered, but AFAIK not much beyond. And I
think those vocabularies still do not solve our problem with updating
labels in multiple languages and keeping them in sync.

That said, this would be quite offtopic for *this* thread, but still if
anybody has any ideas on how to present Wikidata content better to
search engines using well-known metadata vocabularies, I think it would
be a very welcome effort.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-15 Thread Nikola Smolenski
On Wed, Aug 15, 2018 at 7:20 AM, Stas Malyshev 
wrote:

> > a Qnumber or Pnumber in there (extra incentive for people to add labels
> > in their language). Probably also everything duplicated in the text
>
> That presents a problem. While you see "instance of": "human", the data
> is P31:Q5. We can, of course, put "instance of": "human" in the index.
> But what if label for Q5 changes? Now we have to re-index 10 million
> records.
>

I haven't thought this through, but would it be possible to index just Q5,
and then when someone searches on "human" to see what are all the items
with the label "human", so that the search becomes "human OR Q5"?
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-15 Thread Gerard Meijssen
Hoi,
May I remind you all that as it is, particularly the "descriptions" are
really problematic. They are often created based on Wikipedia categories
and it is quite rare that they get updated. Compare this with the
"automated descriptions" that have been around for years.

When new properties are added to an item, it may change the automated
description as a result and, this is reflected in any language. These
changed descriptions may be stored until the next update on the item, they
may be generated when needed and obviously they may be cached. They may be
used in the build up of a search and this will be a much bigger incentive
for people to update labels.

Contrary what some think, labels are updated based on a "need", this need
is hardly there because Wikidata only appeals to geeks. It is why the
Reasonator approach to labelisation makes so much sense. You see the
missing labels, you add them and the next item will show the new labels.
Given that people work in domains, it is a sound approach and, this will
also quite quickly improve the quality of "automated descriptions" in any
language.

Did I tell you that I disambiguate items by adding labels and properties in
Wikidata? In Reasonator when you refresh a "search" you will see for
instance a date of birth death added making John Smith *that* John Smith,

Obviously, search could be a lot better and using "automated descriptions"
will make a positive difference.
Thanks,
   GerardM

On 15 August 2018 at 07:20, Stas Malyshev  wrote:

> Hi!
>
> > https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently
> > our query service is a very strong and complete service, but Wikidata
> > search is very poor. Let's take Blade Runner.
>
> I don't think it's *very* poor anymore, but it certainly can be better.
>
> > In my ideal world, everything I see as a human gets indexed into the
> > search engine preferably in a per language index. For example for Dutch
>
> Err The problem is that what you see as a human and what search
> engine uses for lookups are very different things. While for text
> articles it is similar, for structured data it's quite different, and
> treating structured data the same way as text is not going to produce
> good results, partially because most search algorithms make assumptions
> that come from text world, partially because we'd be ignoring useful
> clues present in structured data.
>
> > something like a text_nl field with the, label, description, aliases,
> > statements and references in there. So index *everything* and never see
>
> There are such fields, but it makes no sense to put references there,
> because there's no such thing as "Dutch reference". References do not
> change with language.
>
> > a Qnumber or Pnumber in there (extra incentive for people to add labels
> > in their language). Probably also everything duplicated in the text
>
> That presents a problem. While you see "instance of": "human", the data
> is P31:Q5. We can, of course, put "instance of": "human" in the index.
> But what if label for Q5 changes? Now we have to re-index 10 million
> records. And while we're doing it, what if another label for such item
> changes again? We'd have to start another million-size reindex. In a
> week, we'd have a backlog of hopeless size, or will require processing
> power that we just don't have. Note also that ElasticSearch doesn't
> really do document updates - it just writes a new document. So frequent
> updates to the same document is not its optimal scenario, and we're
> talking about propagating each label edit to each item that is linked to
> that one. I'm afraid that would explode on us very quickly.
>
> The problem is not indexing labels, the problem is keeping them
> up-to-date on 50 million interlinked items.
>
> When displaying, it's easy - you don't need to worry until you show it,
> and most items are shown only rarely. Even then you see a label out of
> date now and then. But with search, you can't update label on use - when
> you want to use it (i.e. look up), it should already be up-to-date,
> otherwise it's useless.
>
> > As for implementation: We already have the logic to serialize our json
> > to the RDF format. Maybe also add a serialization format for this that
> > is easy to ingest by search engines?
>
> I don't know any such special format, do you? We of course have JSON
> updates to ElasticSearch, but as I noted before, updates are the problem
> there, not format. RDF of course also does not carry denormalized data,
> so we also update only entries that need updating, and fetch labels on
> use. We can not do it for search index. I don't think format here is the
> problem.
>
> > . Making it easier to index not only for our own search would be a nice
> > added benefit.
>
> Sure, but experience have shown that the strategy of "dump everything
> into one huge text" works very poorly in Wikidata. That's why we
> implemented specialized search that knows about how the structured data
> 

Re: [Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-14 Thread Stas Malyshev
Hi!

> https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently
> our query service is a very strong and complete service, but Wikidata
> search is very poor. Let's take Blade Runner.

I don't think it's *very* poor anymore, but it certainly can be better.

> In my ideal world, everything I see as a human gets indexed into the
> search engine preferably in a per language index. For example for Dutch

Err The problem is that what you see as a human and what search
engine uses for lookups are very different things. While for text
articles it is similar, for structured data it's quite different, and
treating structured data the same way as text is not going to produce
good results, partially because most search algorithms make assumptions
that come from text world, partially because we'd be ignoring useful
clues present in structured data.

> something like a text_nl field with the, label, description, aliases,
> statements and references in there. So index *everything* and never see

There are such fields, but it makes no sense to put references there,
because there's no such thing as "Dutch reference". References do not
change with language.

> a Qnumber or Pnumber in there (extra incentive for people to add labels
> in their language). Probably also everything duplicated in the text

That presents a problem. While you see "instance of": "human", the data
is P31:Q5. We can, of course, put "instance of": "human" in the index.
But what if label for Q5 changes? Now we have to re-index 10 million
records. And while we're doing it, what if another label for such item
changes again? We'd have to start another million-size reindex. In a
week, we'd have a backlog of hopeless size, or will require processing
power that we just don't have. Note also that ElasticSearch doesn't
really do document updates - it just writes a new document. So frequent
updates to the same document is not its optimal scenario, and we're
talking about propagating each label edit to each item that is linked to
that one. I'm afraid that would explode on us very quickly.

The problem is not indexing labels, the problem is keeping them
up-to-date on 50 million interlinked items.

When displaying, it's easy - you don't need to worry until you show it,
and most items are shown only rarely. Even then you see a label out of
date now and then. But with search, you can't update label on use - when
you want to use it (i.e. look up), it should already be up-to-date,
otherwise it's useless.

> As for implementation: We already have the logic to serialize our json
> to the RDF format. Maybe also add a serialization format for this that
> is easy to ingest by search engines? 

I don't know any such special format, do you? We of course have JSON
updates to ElasticSearch, but as I noted before, updates are the problem
there, not format. RDF of course also does not carry denormalized data,
so we also update only entries that need updating, and fetch labels on
use. We can not do it for search index. I don't think format here is the
problem.

> . Making it easier to index not only for our own search would be a nice
> added benefit.

Sure, but experience have shown that the strategy of "dump everything
into one huge text" works very poorly in Wikidata. That's why we
implemented specialized search that knows about how the structured data
works. If the search sucks less now than it did before, that's the reason.

> How feasible is this? Do we already have one or multiple tasks for this
> on Phabricator? Phabricator has gotten a bit unclear when it comes to
> Wikidata search, I think because of misunderstanding between people what
> the goal of the task is. Might be worthwhile spending some time on
> structuring that.

Wikidata search tasks would be under "Wikidata" + "Discovery-Search".
There are multiple tasks for it, but if you want to add any, please feel
welcome to browse and add.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Indexing everything (was Re: Indexing all item properties in ElasticSearch)

2018-08-04 Thread Maarten Dammers

Hi Stas and Hay,


On 28-07-18 02:12, Stas Malyshev wrote:

Hi!


I could definitely see a usecase for 1) and maybe for 2). For example,
let's say i remember that one movie that Rutger Hauer played in, just
searching for 'movie rutger hauer' gives back nothing:

https://www.wikidata.org/w/index.php?search=movie+rutger+hauer

While Wikipedia gives back quite a nice list of options:

https://en.wikipedia.org/w/index.php?search=movie+rutger+hauer

Well, this is not going to change with the work we're discussing. The
reason you don't get anything from Wikidata is because "movie" and
"rutger hauer" are labels from different documents and ElasticSearch
does not do joins. We only index each document in itself, and possibly
some additional data, but indexing labels from other documents is now
beyond what we're doing. We could certainly discuss it but that would be
separate (and much bigger) discussion.
Changing the topic because I would like to start this separate and 
bigger discussion. Query and search are quite similar, but also very 
different (if you search you'll run into nice articles like 
https://everypageispageone.com/2011/07/13/search-vs-query/ ). Currently 
our query service is a very strong and complete service, but Wikidata 
search is very poor. Let's take Blade Runner.

* https://www.wikidata.org/wiki/Q184843 is what a human sees
* http://www.wikidata.org/entity/Q184843.json our internal JSON structure
* http://www.wikidata.org/entity/Q184843.rdf source for the query engine
* https://www.wikidata.org/w/index.php?title=Q184843=cirrusdump 
what's indexed in the search engine


In my ideal world, everything I see as a human gets indexed into the 
search engine preferably in a per language index. For example for Dutch 
something like a text_nl field with the, label, description, aliases, 
statements and references in there. So index *everything* and never see 
a Qnumber or Pnumber in there (extra incentive for people to add labels 
in their language). Probably also everything duplicated in the text 
field to fall back to. In this index you would have the "movie Rutger 
Hauer", you would have the cast members ("rolverdeling: Harrison Ford" 
etc.). Yes, this will give a significant increase of index size, but 
will make it much more easier to actually find things.


As for implementation: We already have the logic to serialize our json 
to the RDF format. Maybe also add a serialization format for this that 
is easy to ingest by search engines? I noticed Google having a hard time 
indexing some of our items, see for example 
https://www.google.com/search?q=The+Feast+of+the+Seagods+site%3Awikidata.org=utf-8=utf-8 
. Duck Duck Go seems to be doing a better job 
https://duckduckgo.com/?q=The+Feast+of+the+Seagods+site%3Awikidata.org=h_=web 
. Making it easier to index not only for our own search would be a nice 
added benefit.


How feasible is this? Do we already have one or multiple tasks for this 
on Phabricator? Phabricator has gotten a bit unclear when it comes to 
Wikidata search, I think because of misunderstanding between people what 
the goal of the task is. Might be worthwhile spending some time on 
structuring that.


Maarten

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata