Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical RDF, Linked Data Signatures #116

Closed
iherman opened this issue Mar 16, 2018 · 20 comments
Closed

Canonical RDF, Linked Data Signatures #116

iherman opened this issue Mar 16, 2018 · 20 comments
Assignees
Labels

Comments

@iherman
Copy link
Member

iherman commented Mar 16, 2018

This issue has been lingering in the RDF community for a long time; the problem was that there was no known algorithm to define a canonical RDF graph representation, a necessary step for signing RDF graphs.

Things have changed a few years ago when, in conjunction with the work on JSON-LD and its applications, there is a published algorithm for the creation of a canonical RDF graph. However, this has never advanced to a WG.

Any usage of RDF (primarily in the form of JSON-LD) in the area of authentication, verifiable claims, etc., would rely on something like that.

@iherman iherman added the Data label Mar 16, 2018
@iherman iherman self-assigned this Mar 16, 2018
@iherman
Copy link
Member Author

iherman commented Mar 16, 2018

Comments of Digital Bazaar on the JSON-LD draft charter (which has explicitly declared these issues out of scope):

W3C Members can't keep kicking this particular can down the road, but I understand why it keeps happening (perceived lack of experts/momentum).

@iherman
Copy link
Member Author

iherman commented Mar 24, 2018

Separate discussions with DB agreed that one way forward is to provide an independent (peer reviewed) mathematical proof for the proposed (and also deployed) algorithm for Canonical RDF. A W3C WG may indeed be ill equipped to provide that. If such a proof was publicly available and considered as accepted by the community, turning it into practice could indeed be done in a Working Group...

@iherman
Copy link
Member Author

iherman commented Apr 24, 2018

There was a discussion at the LDOW Workshop at the Web Conference WWW'2018 in Lyon, April 2018, which emphasized the importance of such standard for the future of Linked Data.

@iherman
Copy link
Member Author

iherman commented Apr 24, 2018

Two academic reference from Aidan Hogan, who has published a canonicalization algorithm:

Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In the Proceedings of the 24th International World Wide Web Conference (WWW), Florence, Italy, May 18–22, 2015.
http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf

Aidan Hogan. "Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes". In ACM Transactions on the Web, 2017.
http://aidanhogan.com/docs/rdf-canonicalisation.pdf

@iherman
Copy link
Member Author

iherman commented Apr 24, 2018

Incidentally... the issue came up in another paper at another workshop (Distributed Ledgers and Linked Data) which does have references to digest algorithms and implementations (and reports on implementations:

M. Sopek, P. Gradzki, W. Kosowski, D. Kuziski, R. Trójczak, and R. Trypuz, “GraphChain: A Distributed Database with Explicit Semantics and Chained RDF Graphs,” in Companion of the The Web Conference 2018 on The Web Conference 2018, Republic and Canton of Geneva, Switzerland, 2018, pp. 1171–1178, https://dl.acm.org/authorize?N655224

The presentation slides are at: https://ml.ms/TheGraphChainLyon (though those go beyond the specific issue of canonicalization)

@iherman
Copy link
Member Author

iherman commented Apr 25, 2018

For completeness: the CG draft on canonicalization:

http://json-ld.github.io/normalization/spec/index.html

With some issues listed at:

http://json-ld.github.io/normalization/issues

@iherman
Copy link
Member Author

iherman commented Jun 8, 2018

I did an implementation of Aidan's algorithm (see #116 (comment)):

https://github.com/iherman/canonical_rdf

This means that Aidan's paper may also be considered as solid basis for a formal specification.

At this point, the question is whether there is sufficient interest in the community to go ahead for a WG.

@iherman
Copy link
Member Author

iherman commented Nov 5, 2018

Notes of a discussion held at TPAC on a way forward:

https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html

@iherman
Copy link
Member Author

iherman commented Jan 19, 2019

Latest discussions made it clear that the original deadlines (see 5th bullet point in the TPAC discussion) cannot be held: lack of time and money. Investigating the way forward further by email.

@iherman
Copy link
Member Author

iherman commented Sep 24, 2019

There has been a TPAC Breakout session that reinforced the current strategy steps:

  1. The algorithm implemented by Digital Bazaar gets a mathematical description that also gets a thorough mathematical review
  2. Once (1) is done, experts from Digital Bazaar and Aidan Hogan would "merge" the two contributions to come up with a single mathematical algorithm
  3. The merged algorithm would be the basis to start a WG. The charter would include mainly the canonicalization algorithm, but also a set of terms to describe Linked Data "proofs" in RDF.

(1) is actually on a good track now: there is a not-yet-public draft for the Digital Bazaar algorithm which starts a reviewing process.

@mmccool
Copy link

mmccool commented Jan 16, 2021

Please let us know when/if we can expect this to be finalized. We would very much like to include signing in WoT TDs but also want to support discovery via directories that support SPARQL and JSON-LD "round tripping". Our fallback plan is to require that directories maintain the original JSON serialization exactly and provide it upon request and use JCS (IETF RFC that defines canonical JSON for signing) but this is kind of inelegant.

@iherman
Copy link
Member Author

iherman commented Jan 17, 2021

I would hope to start the chartering procedure in the first quarter of this year. It took longer than expected to get to this point.

Cc @msporny @dlongley @pchampin

@msporny
Copy link
Member

msporny commented Jan 17, 2021

Separate discussions with DB agreed that one way forward is to provide an independent (peer reviewed) mathematical proof for the proposed (and also deployed) algorithm for Canonical RDF. A W3C WG may indeed be ill equipped to provide that. If such a proof was publicly available and considered as accepted by the community, turning it into practice could indeed be done in a Working Group.

Just to report in on progress here -- we now have (as of mid-2020) multiple independent reviews of the mathematical proofs for the RDF Canonicalization algorithm from "real mathematicians". We need to ask the reviewers to make those documents public so they can be taken as input documents into the work.

Many of the people using the Linked Data Security work (building implementations, using it in pilot/production systems) are currently embroiled in getting the Decentralized Identifier specification to CR/PR -- it is the very next thing many of us want to do after DID Core goes to CR/PR, and we expect that to happen in early to mid 2021. So, doing a "charter in progress" announcement for the "Linked Data Security Working Group" seems like a good idea around March 2021 or so.

The final thing we were waiting for is for multiple WGs to note that they want to see this work done, so we get the W3C Member votes needed. We expect that the VCWG, DIDWG, CCG, and now hopefully the WoTWG members would support the work.

It feels like all of this is converging and we should be able to make progress on this WG this year since it seems to becoming a critical path item for multiple WGs.

@dbooth-boston
Copy link

@msporny , was the canonicalization algorithm upgraded (based on Aiden Hogan's work) to address the diff use case in addition to the digital signatures use case? Addressing the diff use case means that a small change to the source graph is likely to yield a small change in its canonicalization. I ask because the diff use case was not addressed in the original JSON-LD canonicalization algorithm.

@msporny
Copy link
Member

msporny commented Mar 4, 2021

@msporny , was the canonicalization algorithm upgraded (based on Aiden Hogan's work) to address the diff use case in addition to the digital signatures use case? Addressing the diff use case means that a small change to the source graph is likely to yield a small change in its canonicalization. I ask because the diff use case was not addressed in the original JSON-LD canonicalization algorithm.

As far as I know, there have been no production level implementations of Aiden's work wrt. diff'ing and so it's been difficult to test the "likely" part of your statement. We'd need more work in that area to see if it's a workable solution.

A large number of the organizations using the Canonical RDF work just want us to start with standardizing the 2015 normalization algorithm (URDNA2015), which is already out in the field. Once that's done, we can work on improving in future c14n algorithms. At least, that would be our preference.

With respect to diff'ing -- the 2015 algorithm can be used to do that (in fact, we do that when our digital signatures fail to work to see what is different in the graph that's being digitally signed and the graph that is being verified). It's usually fairly easy to see where things went wrong. Is that the use case you're wondering about, or were you hoping for something else/more?

@iherman
Copy link
Member Author

iherman commented Mar 4, 2021

With respect to diff'ing -- the 2015 algorithm can be used to do that (in fact, we do that when our digital signatures fail to work to see what is different in the graph that's being digitally signed and the graph that is being verified). It's usually fairly easy to see where things went wrong. Is that the use case you're wondering about, or were you hoping for something else/more?

@msporny, I would need some technical explanation on this. As @dbooth-boston said, my feeling about the canonicalization algorithm (whether yours or Aidan) is that even a small change on the graph will lead to possibly big changes, because the blank node identification may radically change resulting in very different canonical forms. Or do you replace things with some form of similarity measure when relabeling the blank nodes and calculating the final hash of the graph?

Diffing may be a major use case for this (just had some discussion about this with @danbri) so I would be happy to be proven wrong.

@msporny
Copy link
Member

msporny commented Mar 4, 2021

@msporny, I would need some technical explanation on this.

To debug a diff of a graph, we take the c14n'd NQuads output of the signing software. Then we take the c14n'd NQuads output of the verifying software, and we literally run diff on the two sets of NQuads. That usually highlights the changes between the two and it's typically compact and easy to read... usually because the reasons things don't line up isn't because of bnodes or graph structure, but due to literals in the graph.

Or do you replace things with some form of similarity measure when relabeling the blank nodes and calculating the final hash of the graph?

This is not done.

even a small change on the graph will lead to possibly big changes, because the blank node identification may radically change resulting in very different canonical forms.

Yes, but we haven't heard this creating a problem for people that have deployed URDNA2015 for years. We'd need to evaluate the use case to see why one needs to minimize changes between changes to the same graph when canonicalizing.

Diffing may be a major use case for this (just had some discussion about this with @danbri) so I would be happy to be proven wrong.

My concern is that there are a growing number of companies that are counting on URDNA2015 being standardized somewhere, people have built software around it, there are proofs on it, and it's been deployed in systems. Even hinting that the WG is going to change how URDNA2015 works (unless there is a critical flaw) is going to create a non-trivial number of upset companies... to the point where I think there will be objections on the charter itself.

Now, to be clear -- I think it's just a matter of priorities. If folks want to innovate on what's already there by changing the canonical form, that's fine, even if they do it in the WG (call it URDNA2021)... but URDNA2015 is a high priority for a number of companies... there is a lot of software already written around it, and I expect that the chartering will fail if we don't specifically call out that a focus will be URDNA2015.

We can do both... URDNA2015 and URDNA2021... I know there is support for URDNA2015... I don't know the use case for URDNA2021, there are no implementations, and I don't know the status of the formal proofs.

@iherman
Copy link
Member Author

iherman commented Mar 5, 2021

To debug a diff of a graph, we take the c14n'd NQuads output of the signing software. Then we take the c14n'd NQuads output of the verifying software, and we literally run diff on the two sets of NQuads. That usually highlights the changes between the two and it's typically compact and easy to read... usually because the reasons things don't line up isn't because of bnodes or graph structure, but due to literals in the graph.

I do not really understand what the role of c14n is in this respect. What c14n does is to re-label, in a canonical manner, the blank nodes, but the re-labeling is different if the two graphs are different. Ie, if the only thing you do is to follow the same procedure for converting your graph into NQuads, without c14n involved, the diff would work the same way...

But we are digressing I guess...

My concern is that there are a growing number of companies that are counting on URDNA2015 being standardized somewhere, people have built software around it, there are proofs on it, and it's been deployed in systems. Even hinting that the WG is going to change how URDNA2015 works (unless there is a critical flaw) is going to create a non-trivial number of upset companies... to the point where I think there will be objections on the charter itself.

Now, to be clear -- I think it's just a matter of priorities. If folks want to innovate on what's already there by changing the canonical form, that's fine, even if they do it in the WG (call it URDNA2021)... but URDNA2015 is a high priority for a number of companies... there is a lot of software already written around it, and I expect that the chartering will fail if we don't specifically call out that a focus will be URDNA2015.

We can do both... URDNA2015 and URDNA2021... I know there is support for URDNA2015... I don't know the use case for URDNA2021, there are no implementations, and I don't know the status of the formal proofs.

Let us discuss this when we get to the specificities of the WG charter...

@iherman
Copy link
Member Author

iherman commented Mar 30, 2021

Official strategy review started #262

@wseltzer
Copy link
Member

wseltzer commented May 7, 2021

Closing this issue in favor of #262

@wseltzer wseltzer closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 participants