Wikidata talk:SPARQL query service/WDQS graph split/Rules

From Wikidata
Jump to navigation Jump to search

Multiple Instances of

[edit]

"Entities having multiple instance of (P31) defined may pose some challenges. Such cases might be considered as data quality issues where the solution should be to disambiguate this entity by creating a separate entity."

I know of items which are - validly, I would argue - both "scholarly article" and "obituary". Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:35, 10 September 2024 (UTC)[reply]

@Andy Mabbett Thanks, I think this demonstrates the limitations of using P31 for identifying the type of publication, you might perhaps be interested in some discussions to re-evaluate this approach using another property, please see Wikidata_talk:WikiCite#Community_input_into_WDQS_graph_split:_a_publication_type_property_proposal for related discussions. DCausse (WMF) (talk) 16:38, 10 September 2024 (UTC)[reply]
I checked with QLever how many scholarly articles we have with multiple P31 values, and I found a bit over 3 million. Samoasambia 18:11, 10 September 2024 (UTC)[reply]
And slightly more when you include not just "scholarly article" but all the subclasses of it. Samoasambia 18:17, 10 September 2024 (UTC)[reply]
Thanks for the query, perhaps I should reformulate this statement by saying that some of these instances may pose some challenges and could be considered as data quality issues if the item is conflating a scientific publication with something else. Looking at few examples:
DCausse (WMF) (talk) 19:26, 10 September 2024 (UTC)[reply]
If this is of any interest here is the list of types used as P31 alongside scholarly articles I compiled to illustrate the discussions we had around this concern at Wikidata_talk:SPARQL_query_service/WDQS_graph_split/WDQS_Split_Refinement#Clinical_trials. DCausse (WMF) (talk) 07:34, 11 September 2024 (UTC)[reply]
Yes, that is interesting. Was it produced by a query, and if so please can you share it? Some of the combinations are clearly in error and need to be fixed. Some suggest a new item might be needed. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:34, 11 September 2024 (UTC)[reply]
I compiled this list using the WMF compute cluster on the dumps, I can re-extract the list in a couple months if you'd like, please don't hesitate to ping me if/when you need this. WDQS is probably going to timeout but apparently qlever is able to extract some of this data: https://qlever.cs.uni-freiburg.de/wikidata/LtDqyR. DCausse (WMF) (talk) 09:21, 12 September 2024 (UTC)[reply]

Books

[edit]

The criteria as given do not seem to differentiate scholarly books from other books.

Can a (non-scholarly) book include (some or all) chapters which are scholarly? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:36, 10 September 2024 (UTC)[reply]

By scholarly books do you mean publications that are instance of edited volume (Q1711593)?
I'm not sure why this type was not included, the goal of split initially started by only identifying scholarly article (Q13442814) but we then extended this list by manually picking some types from this spreadsheet. Do you think edited volume (Q1711593) should be considered as scientific publications and served from the scholarly_articles subgraph endpoint?
Note that these rules do not restrict in any ways how the data can be shaped in wikidata, it just dictates, on a per entity basis from where this entity is going to be served by the wikidata query service. DCausse (WMF) (talk) 17:51, 10 September 2024 (UTC)[reply]
edited volume (Q1711593) certainly seems to fit what I had in mind. I do think that if the decision as to which graph a class falls under can be determined unambiguously by tracing it back to a specific parent class ("shaped in Wikidata", as you put it) - a particular branch of the tree, in other words - then there will be less pain in future than by matching it to what might be considered an arbitrary and ambiguous list. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:50, 10 September 2024 (UTC)[reply]

Parent class

[edit]

Surely all the classes listed under Wikidata:SPARQL query service/WDQS graph split/Rules#Scholarly Articles should be subclass of [thing], where nothing else is a subclass (or instance) of [thing]? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:38, 10 September 2024 (UTC)[reply]

I'm not sure I understand your point, could you elaborate a bit more? DCausse (WMF) (talk) 17:52, 10 September 2024 (UTC)[reply]
The list of classes in the subsection to which I linked are defined there as being "scholarly articles". Clearly this is not the same as scholarly article (Q13442814). But if they are a set of classes with common features there should be a parent class which encompasses all of them; and which excludes any classes which lack those features. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:40, 10 September 2024 (UTC)[reply]
I think I agree with you and I believe a more comprehensive way to defining these rules would be through another property as suggested in Wikidata_talk:WikiCite#Community_input_into_WDQS_graph_split:_a_publication_type_property_proposal. DCausse (WMF) (talk) 19:57, 10 September 2024 (UTC)[reply]

Comments

[edit]

The list at Wikidata:SPARQL query service/WDQS graph split/Rules#Scholarly Articles includes comment (Q58897583).

  1. That is said to have equivalent class (P1709)https://schema.org/Comment. Scehma.org defies this as "A comment on an item - for example, a comment on a blog post.".
  2. blog comment (Q84572095) is a subclass of comment (Q58897583).

Are either of these statements a cause for concern?

Can a subclass of a scholarly item be not a scholarly item? Are there other such cases? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:50, 10 September 2024 (UTC)[reply]

I'm not knowledgeable enough here to judge. Looking at the data the vast majority of comment (Q58897583) are also scholarly articles (74 out of the 80 comments in wikidata). The 6 remaining comments can be listed here: https://w.wiki/B9j9.
Does this mean that we have to take comment (Q58897583) out of the list here? I think this is an open question.
Regarding blog comment being subclass of comment: this is technically not a problem for the split since it does not take into account the class hierarchy, from a usability point of view I agree that this might be misleading. DCausse (WMF) (talk) 16:55, 10 September 2024 (UTC)[reply]

Publication type of scholarly article

[edit]

Please see Wikidata:Property proposal/publication type of scholarly article which I hope might address some of the confusion with these rules DCausse (WMF) (talk) 13:30, 12 September 2024 (UTC)[reply]