Jump to content

Parsoid/OutputTransform/HtmlHolder: Difference between revisions

From mediawiki.org
Content deleted Content added
→‎Design decisions: Checked the parsing model for <template>; a note about JsonCodecs.
m typo
Line 4: Line 4:
The MediaWiki DOM Spec contains a large number of "JSON-valued" attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the <code>DOMDataUtils</code> class, based on the <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMDataUtils.html#a0b748ea183981976970a573efa780595|DOMDataUtils::getJSONAttribute()]]</code> method which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object Object]. This value is nominally present in "plain old HTML" or via [https://developer.mozilla.org/en-US/docs/Web/API/Element/getAttribute Element::getAttribute()] as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and re-serialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array or object value attached to the DOM Node. References can be kept to the live value, and it can be mutated and that change is immediately visible to anyone else which has a reference to the value.
The MediaWiki DOM Spec contains a large number of "JSON-valued" attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the <code>DOMDataUtils</code> class, based on the <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMDataUtils.html#a0b748ea183981976970a573efa780595|DOMDataUtils::getJSONAttribute()]]</code> method which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS [https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object Object]. This value is nominally present in "plain old HTML" or via [https://developer.mozilla.org/en-US/docs/Web/API/Element/getAttribute Element::getAttribute()] as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and re-serialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array or object value attached to the DOM Node. References can be kept to the live value, and it can be mutated and that change is immediately visible to anyone else which has a reference to the value.


The actual implementation is a bit baroque -- in addition to a [[Parsoid/MediaWiki DOM spec/Rich Attributes|proposal ("Rich attributes") to extend this basic mechanism to include DOM DocumentFragment values]], there are multiple different serialization formats for these structured values. The nominal "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. This is mitigated by the use of single-quotes around the attribute value in a minor departure from [https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes standard HTML serialization], but if the structured value contains HTML markup escaping becomes inevitable as (a) both available quotation marks have been used, and (b) <code><</code> and <code>&</code> are additionally required to be escaped. There's a separate but orthogonal issue with the exposure of "private" attributes in this naive serialization, [[Parsoid/OutputTransform/HtmlHolder#Private attributes|discussed below]]. For these two reasons, Parsoid has historically supported two additional alternative encodings of structured attributes. By adding a unique [https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/id id attribute] to every node, the values of structured attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is called a [[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Core_1_1PageBundle.html|PageBundle]] (page bundles have further uses [[Parsoid/OutputTransform/HtmlHolder#Enumeration of fragments and metadata|described below]]). In another representation the combination is kept as a single HTML document, but the JSON-encoded map is stored in a <code><script></code> element in the <code><head></code> of the Document. This reduces the bloat caused by encoding all the quotation marks in the structured attributes, but adds additional bandwidth to record ID attributes on every node and additionally to include those ID values in the key portion of the map in the <code><head></code>.
The actual implementation is a bit baroque -- in addition to a [[Parsoid/MediaWiki DOM spec/Rich Attributes|proposal ("Rich attributes") to extend this basic mechanism to include DOM DocumentFragment values]], there are multiple different serialization formats for these structured values. The nominal "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. This is mitigated by the use of single-quotes around the attribute value in a minor departure from [https://w3c.github.io/DOM-Parsing/#serializing-an-element-s-attributes standard HTML serialization], but if the structured value contains HTML markup escaping becomes inevitable as (a) both available quotation marks have been used, and (b) <code><</code> and <code>&</code> are additionally required to be escaped. There's a separate but orthogonal issue with the exposure of "private" attributes in this naive serialization, [[Parsoid/OutputTransform/HtmlHolder#Private attributes|discussed below]]. For these two reasons, Parsoid has historically supported two additional alternative encodings of structured attributes. By adding a unique [https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/id id attribute] to every node, the values of structured attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is called a [[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Core_1_1PageBundle.html|PageBundle]] (page bundles have further uses [[Parsoid/OutputTransform/HtmlHolder#Enumeration of fragments and metadata|described below]]). In another representation the combination is kept as a single HTML document, but the JSON-encoded map is stored in a script element in the head of the Document. This reduces the bloat caused by encoding all the quotation marks in the structured attributes, but adds additional bandwidth to record ID attributes on every node and additionally to include those ID values in the key portion of the map in the head.


This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation the ID-to-value map is stored in a <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1NodeData_1_1DataBag.html|DataBag]]</code> which is attached to the root <code>Document</code> object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the <code>Node</code> implemented by the <code>[https://github.com/GNOME/libxml2 libxml2]</code> library. Those ephemeral PHP wrapper objects are created and destroyed every time a reference to the Node goes into or out of scope in PHP. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed, even if a reference to the Node is still present in the native document model. By keeping a persistent reference to the (wrapper of the) main <code>Document</code> object in Parsoid's <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Config_1_1Env.html|Env]]</code> class, which is kept alive for the duration of the parse, we can prevent the <code>DataBag</code> from being destroyed. (We could also just keep an explicit reference to the <code>DataBag</code> in the <code>Env</code> which would avoid the use of dynamic properties in PHP.)
This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation the ID-to-value map is stored in a <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1NodeData_1_1DataBag.html|DataBag]]</code> which is attached to the root <code>Document</code> object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the <code>Node</code> implemented by the <code>[https://github.com/GNOME/libxml2 libxml2]</code> library. Those ephemeral PHP wrapper objects are created and destroyed every time a reference to the Node goes into or out of scope in PHP. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed, even if a reference to the Node is still present in the native document model. By keeping a persistent reference to the (wrapper of the) main <code>Document</code> object in Parsoid's <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Config_1_1Env.html|Env]]</code> class, which is kept alive for the duration of the parse, we can prevent the <code>DataBag</code> from being destroyed. (We could also just keep an explicit reference to the <code>DataBag</code> in the <code>Env</code> which would avoid the use of dynamic properties in PHP.)
Line 15: Line 15:
Special treatment was also extended to <code>data-mw</code> attributes, which were used by convention to store information "needed by editing clients but not for readers". The idea was that <code>data-mw</code> attributes would also be stripped in content served for read views or for reader clients to save additional bandwidth.
Special treatment was also extended to <code>data-mw</code> attributes, which were used by convention to store information "needed by editing clients but not for readers". The idea was that <code>data-mw</code> attributes would also be stripped in content served for read views or for reader clients to save additional bandwidth.


In this context, an additional benefit to storing the structured attributes outside the <code>Document</code> (or in a separate element in the <code><head></code>) was that it allowed API code to efficiently implement these attribute-stripping strategies without requiring node-by-node traversal. In practice this benefit was undercut by the fact that Parsoid's principal client, VisualEditor, used the contents of <code>data-mw</code> attributes, requiring the <code>data-mw</code> attributes to be explicitly reloaded from the separate storage before the HTML was usable.
In this context, an additional benefit to storing the structured attributes outside the <code>Document</code> (or in a separate element in the head) was that it allowed API code to efficiently implement these attribute-stripping strategies without requiring node-by-node traversal. In practice this benefit was undercut by the fact that Parsoid's principal client, VisualEditor, used the contents of <code>data-mw</code> attributes, requiring the <code>data-mw</code> attributes to be explicitly reloaded from the separate storage before the HTML was usable.


Since the implementation of separate storage was tied to abstraction boundary design goals for <code>data-parsoid</code> and <code>data-mw</code> specifically, the <code>DataBag</code> and load/store mechanism was initially implemented only for structured values of these two attributes. Other structured value attributes used the uncached storage mechanism of <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMDataUtils.html#a0b748ea183981976970a573efa780595|DOMDataUtils::getJSONAttribute]]()</code> and the values returned were not live but had to be explicitly saved with <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMDataUtils.html#a21031987e192ab8dc6a56ed11e0a59b1|DOMDataUtils::setJSONAttribute()]]</code>. The [[Parsoid/MediaWiki DOM spec/Rich Attributes|Rich Attribute]] proposal would extend live storage to all structured-value attributes and separate out the policy decision regarding the precise set of structured attributes to be encoded in separate storage (as opposed to inline).
Since the implementation of separate storage was tied to abstraction boundary design goals for <code>data-parsoid</code> and <code>data-mw</code> specifically, the <code>DataBag</code> and load/store mechanism was initially implemented only for structured values of these two attributes. Other structured value attributes used the uncached storage mechanism of <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMDataUtils.html#a0b748ea183981976970a573efa780595|DOMDataUtils::getJSONAttribute]]()</code> and the values returned were not live but had to be explicitly saved with <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMDataUtils.html#a21031987e192ab8dc6a56ed11e0a59b1|DOMDataUtils::setJSONAttribute()]]</code>. The [[Parsoid/MediaWiki DOM spec/Rich Attributes|Rich Attribute]] proposal would extend live storage to all structured-value attributes and separate out the policy decision regarding the precise set of structured attributes to be encoded in separate storage (as opposed to inline).
Line 22: Line 22:
For an <code>HtmlHolder</code> interface in core, two views of the document are provided: an HTML string and a DOM object model.
For an <code>HtmlHolder</code> interface in core, two views of the document are provided: an HTML string and a DOM object model.


We have decided that the DOM representation will contain structured data that has been appropriately "loaded" -- that is, operations provided to core that operate on structured values will work immediately on the DOM returned without requiring an explicit load step. An equivalent to <code>DOMDataUtils::getJSONAttribute()</code> will be provided in core (or more likely, in an HTML library which may also contain parts of Parsoid's <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMCompat.html|DOMCompat]]</code> library) which will work on the DOM as returned by <code>HtmlHolder</code>. This is consistent with either an eager "load" step occuring after string-form HTML is parsed, or with a lazy load step integrated with the implementation of the structured value API provided to core.
We have decided that the DOM representation will contain structured data that has been appropriately "loaded" -- that is, operations provided to core that operate on structured values will work immediately on the DOM returned without requiring an explicit load step. An equivalent to <code>DOMDataUtils::getJSONAttribute()</code> will be provided in core (or more likely, in an HTML library which may also contain parts of Parsoid's <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Utils_1_1DOMCompat.html|DOMCompat]]</code> library) which will work on the DOM as returned by <code>HtmlHolder</code>. This is consistent with either an eager "load" step after string-form HTML is parsed, or with a lazy load step integrated with the implementation of the structured value API provided to core.


The HTML string provided by <code>HtmlHolder</code> will be the "naive" inline-attribute serialization of the document, not one of the alternate encodings. When converting from DOM to a string, an appropriate "store" step will be performed to serialize the current live values of structured attributes. Private attributes like <code>data-parsoid</code> will <u>not</u> be stripped from the HTML string. <code>HtmlHolder</code> will therefore need to know about "structured value" HTML (again, as an abstraction provided by the HTML library used), but will not need to specially handle <code>data-parsoid</code> or any other Parsoid-internal attributes.
The HTML string provided by <code>HtmlHolder</code> will be the "naive" inline-attribute serialization of the document, not one of the alternate encodings. When converting from DOM to a string, an appropriate "store" step will be performed to serialize the current live values of structured attributes. Private attributes like <code>data-parsoid</code> will <u>not</u> be stripped from the HTML string. <code>HtmlHolder</code> will therefore need to know about "structured value" HTML (again, as an abstraction provided by the HTML library used), but will not need to specially handle <code>data-parsoid</code> or any other Parsoid-internal attributes.
Line 50: Line 50:
The <code>PageBundle</code> data structure will be removed from Parsoid and moved from core's <code>MediaWiki\Parser</code> namespace into the REST API implementation, as a feature of the REST interface design but not a core Parsoid abstraction. The metadata written by Parsoid which is not already reflected by appropriate <code>ParserOutput</code> properties, such as specific content headers needed by the REST API, will be written by Parsoid directly to <code>ParserOutput</code> extension data using Parsoid's ContentMetadataCollection interface, either using the existing <code>parsoid-page-bundle</code> key or new keys specific to the particular metadata. The main Parsoid entrypoints will use <code>HtmlHolder</code>+<code>ContentMetadataCollector</code> as result types rather than page bundle; this will also avoid a serialization step and allow Parsoid to return its DOM result (with live structured attributes) directly to core. The <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Wt2Html_1_1XMLSerializer.html|XmlSerializer]]</code> code can also be removed from Parsoid, since Parsoid's APIs will now be DOM-based. <code>HtmlHolder</code> plus support code for structured attributes will likely be moved to a library, which is probably also a good home for serialization code like <code>XmlSerializer</code>.
The <code>PageBundle</code> data structure will be removed from Parsoid and moved from core's <code>MediaWiki\Parser</code> namespace into the REST API implementation, as a feature of the REST interface design but not a core Parsoid abstraction. The metadata written by Parsoid which is not already reflected by appropriate <code>ParserOutput</code> properties, such as specific content headers needed by the REST API, will be written by Parsoid directly to <code>ParserOutput</code> extension data using Parsoid's ContentMetadataCollection interface, either using the existing <code>parsoid-page-bundle</code> key or new keys specific to the particular metadata. The main Parsoid entrypoints will use <code>HtmlHolder</code>+<code>ContentMetadataCollector</code> as result types rather than page bundle; this will also avoid a serialization step and allow Parsoid to return its DOM result (with live structured attributes) directly to core. The <code>[[wmdoc:Parsoid-PHP/master/classWikimedia_1_1Parsoid_1_1Wt2Html_1_1XMLSerializer.html|XmlSerializer]]</code> code can also be removed from Parsoid, since Parsoid's APIs will now be DOM-based. <code>HtmlHolder</code> plus support code for structured attributes will likely be moved to a library, which is probably also a good home for serialization code like <code>XmlSerializer</code>.


''(Tentatively:)'' An API will be provided to store and fetch <code>[https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment DocumentFragments]</code> by ID from <code>[https://developer.mozilla.org/en-US/docs/Web/HTML/Element/template <template>]</code> elements in the Document <code><head></code>. "Child" <code>HtmlHolder</code> instances will serialize themselves as simply the appropriate ID key, and they will fetch the appropriate <code>DocumentFragment</code> from the parent based on ID when necessary. This will allow storage of <code>DocumentFragment</code>s (held by the <code>HtmlHolder</code>) in extension data or in <code>ParserOutput</code> fields. Live manipulation of structured data contained within these fragments will then be appropriately loaded and stored by the parent <code>Document</code> (held by its own <code>HtmlHolder</code>). Since these child fragments are part of the main document tree, they can be enumerated and mutated in-place by post-processing passes without explicit knowledge, and structured data attributes within the fragment will be transparently held by the <code>DataBag</code> or other mechanism used by the parent. Inside the HTML library, an API will allow easy creation of a new empty <code>DocumentFragment</code>/<code>HtmlHolder</code> tied to the owner document, as well as (for legacy compatibility) creating a new <code>DocumentFragment</code> /<code>HtmlHolder</code> tied to the owner document from an HTML string. Enumerating all fragments for post-processing can be done with <code>Document::[https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll querySelectorAll]('body, head > template')</code>; this can also be exposed as an API helper method.
''(Tentatively:)'' An API will be provided to store and fetch <code>[https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment DocumentFragments]</code> by ID from [https://developer.mozilla.org/en-US/docs/Web/HTML/Element/template template] elements in the Document head. "Child" <code>HtmlHolder</code> instances will serialize themselves as simply the appropriate ID key, and they will fetch the appropriate <code>DocumentFragment</code> from the parent based on ID when necessary. This will allow storage of <code>DocumentFragment</code>s (held by the <code>HtmlHolder</code>) in extension data or in <code>ParserOutput</code> fields. Live manipulation of structured data contained within these fragments will then be appropriately loaded and stored by the parent <code>Document</code> (held by its own <code>HtmlHolder</code>). Since these child fragments are part of the main document tree, they can be enumerated and mutated in-place by post-processing passes without explicit knowledge, and structured data attributes within the fragment will be transparently held by the <code>DataBag</code> or other mechanism used by the parent. Inside the HTML library, an API will allow easy creation of a new empty <code>DocumentFragment</code>/<code>HtmlHolder</code> tied to the owner document, as well as (for legacy compatibility) creating a new <code>DocumentFragment</code> /<code>HtmlHolder</code> tied to the owner document from an HTML string. Enumerating all fragments for post-processing can be done with <code>Document::[https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll querySelectorAll]('body, head > template')</code>; this can also be exposed as an API helper method.


The JSON codec for child holders will need to use the codec context to ensure that child <code>HtmlHolder</code> objects are properly relinked to the parent on deserialization. The [https://wikimedia.slack.com/archives/C024Z8K9CAU/p1676580615953829 Slack discussion] on [[phab:T346829|T346829]] seemed to get hung up on whether these sort of stateful deserializers should be "discouraged but possible" or whether the JSON codec wanted to explicitly prohibit anything but value objects. If serialization is restricted to simple value objects, then the <code>HtmlHolder::getHtml()</code> and <code>HtmlHolder::getDom()</code> methods need to include a parent object (ParserOutput, parent HtmlHolder, etc) as an explicit parameter (or as an explicit parameter of a similar-but-not-identical <code>ChildHtmlHolder</code> class) so that the child holder can be relinked to the parent Document after deserialization. (Perhaps even <code>ParserOutput::getRawText()</code> return a "child" <code>HtmlHolder</code>, with the contents of the <code><body></code> tag a special case, and the full parent document is stored elsewhere. This makes all <code>HtmlHolder</code>s "children" with references to a special <code>ParentHtmlHolder</code>; ie the parent is the exception, not the child.)
The JSON codec for child holders will need to use the codec context to ensure that child <code>HtmlHolder</code> objects are properly relinked to the parent on deserialization. The [https://wikimedia.slack.com/archives/C024Z8K9CAU/p1676580615953829 Slack discussion] on [[phab:T346829|T346829]] seemed to get hung up on whether these sort of stateful deserializers should be "discouraged but possible" or whether the JSON codec wanted to explicitly prohibit anything but value objects. If serialization is restricted to simple value objects, then the <code>HtmlHolder::getHtml()</code> and <code>HtmlHolder::getDom()</code> methods need to include a parent object (ParserOutput, parent HtmlHolder, etc) as an explicit parameter (or as an explicit parameter of a similar-but-not-identical <code>ChildHtmlHolder</code> class) so that the child holder can be relinked to the parent Document after deserialization. (Perhaps even <code>ParserOutput::getRawText()</code> return a "child" <code>HtmlHolder</code>, with the contents of the body tag a special case, and the full parent document is stored elsewhere. This makes all <code>HtmlHolder</code>s "children" with references to a special <code>ParentHtmlHolder</code>; ie the parent is the exception, not the child.)


Note that the initial steps in the [[Parsoid/MediaWiki DOM spec/Rich Attributes|Rich Attributes]] proposal (before proposal 3) requires the caller to provide the context type for the deserializer, which is a different design that the JsonCodec used in core, which uses the more typical design where the serialized object contains its own type marker.
Note that the initial steps in the [[Parsoid/MediaWiki DOM spec/Rich Attributes|Rich Attributes]] proposal (before proposal 3) requires the caller to provide the context type for the deserializer, which is a different design that the JsonCodec used in core, which uses the more typical design where the serialized object contains its own type marker.


I believe that the [https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intempla parsing model for <code><template></code>] inside <code><head></code> allows serialization of fragments in an appropriate way. <code><meta></code> and <code><link></code> tags are processed using the "[https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead in head]" insertion mode, but this seems to match how they are processed "[https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody in body]". If there is some issue it may be necessary to add another wrapper element inside the <code><template></code> (like a <code><body></code> tag) to reset the parsing mode so we're not "in template" and <code><meta></code> etc tags are parsed properly.
I believe that the [https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intempla parsing model for template] inside head allows serialization of fragments in an appropriate way. meta and link tags are processed using the "[https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhead in head]" insertion mode, but this seems to match how they are processed "[https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody in body]". If there is some issue it may be necessary to add another wrapper element inside the template (like a body tag) to reset the parsing mode so we're not "in template" and meta etc tags are parsed properly.

Revision as of 08:04, 26 November 2023

When Parsoid (and Parsoid-aware transforms) hold a DOM object model, there are two important features/extensions which an HtmlHolder interface in core needs to be aware of. The first is structured-value and private attributes within the Document, and the other is the representation of standalone document fragments. This document describes these features and presents consequent design decisions relating to the HtmlHolder abstraction.

Structured-value attributes and the DataBag

The MediaWiki DOM Spec contains a large number of "JSON-valued" attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the DOMDataUtils class, based on the DOMDataUtils::getJSONAttribute() method which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS Object. This value is nominally present in "plain old HTML" or via Element::getAttribute() as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and re-serialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array or object value attached to the DOM Node. References can be kept to the live value, and it can be mutated and that change is immediately visible to anyone else which has a reference to the value.

The actual implementation is a bit baroque -- in addition to a proposal ("Rich attributes") to extend this basic mechanism to include DOM DocumentFragment values, there are multiple different serialization formats for these structured values. The nominal "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. This is mitigated by the use of single-quotes around the attribute value in a minor departure from standard HTML serialization, but if the structured value contains HTML markup escaping becomes inevitable as (a) both available quotation marks have been used, and (b) < and & are additionally required to be escaped. There's a separate but orthogonal issue with the exposure of "private" attributes in this naive serialization, discussed below. For these two reasons, Parsoid has historically supported two additional alternative encodings of structured attributes. By adding a unique id attribute to every node, the values of structured attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is called a PageBundle (page bundles have further uses described below). In another representation the combination is kept as a single HTML document, but the JSON-encoded map is stored in a ‎<script> element in the ‎<head> of the Document. This reduces the bloat caused by encoding all the quotation marks in the structured attributes, but adds additional bandwidth to record ID attributes on every node and additionally to include those ID values in the key portion of the map in the ‎<head>.

This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation the ID-to-value map is stored in a DataBag which is attached to the root Document object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the Node implemented by the libxml2 library. Those ephemeral PHP wrapper objects are created and destroyed every time a reference to the Node goes into or out of scope in PHP. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed, even if a reference to the Node is still present in the native document model. By keeping a persistent reference to the (wrapper of the) main Document object in Parsoid's Env class, which is kept alive for the duration of the parse, we can prevent the DataBag from being destroyed. (We could also just keep an explicit reference to the DataBag in the Env which would avoid the use of dynamic properties in PHP.)

Parsoid contains a "load" mechanism that runs after DOM parsing which loads structured-valued attributes into the DataBag, implemented in DOMDataUtils::visitAndLoadDataAttribs(), and a corresponding "store" mechanism in DOMDataUtils::visitAndStoreDataAttribs(). In our implementation changes made to the live object stored in the DataBag are not reflected in the raw attribute value visible via Element::getAttribute() until a "store" is done. Similarly, several Parsoid helper methods based on structured attributes, like ::getDataParsoid(), will not work correctly until a "load" is done. The "eager" loading mechanism could be replaced by a "lazy" loader which didn't locate and load structured values (whether from inline attributes or a map) until requested. Lazy loading could eliminate the need for an explicit "load" step, but since the values are live and can be mutated without notification to the DOM layer, an explicit "save" step will always be necessary to ensure the serialized DOM reflects the latest values for structured-value attributes.

Private attributes

The implementation and encoding of structured-value attributes in Parsoid was also influenced by an API decision that the contents of data-parsoid attributes were to be considered implementation-private. This was enforced at an API level by stripping the data-parsoid attributes in HTML provided to most clients, and then re-inserting the attributes from separate storage when necessary, keyed by a render ID assigned to the parse. In addition to strictly enforcing the abstraction boundary, this also saved bandwidth on API responses.

Special treatment was also extended to data-mw attributes, which were used by convention to store information "needed by editing clients but not for readers". The idea was that data-mw attributes would also be stripped in content served for read views or for reader clients to save additional bandwidth.

In this context, an additional benefit to storing the structured attributes outside the Document (or in a separate element in the ‎<head>) was that it allowed API code to efficiently implement these attribute-stripping strategies without requiring node-by-node traversal. In practice this benefit was undercut by the fact that Parsoid's principal client, VisualEditor, used the contents of data-mw attributes, requiring the data-mw attributes to be explicitly reloaded from the separate storage before the HTML was usable.

Since the implementation of separate storage was tied to abstraction boundary design goals for data-parsoid and data-mw specifically, the DataBag and load/store mechanism was initially implemented only for structured values of these two attributes. Other structured value attributes used the uncached storage mechanism of DOMDataUtils::getJSONAttribute() and the values returned were not live but had to be explicitly saved with DOMDataUtils::setJSONAttribute(). The Rich Attribute proposal would extend live storage to all structured-value attributes and separate out the policy decision regarding the precise set of structured attributes to be encoded in separate storage (as opposed to inline).

Design decisions

For an HtmlHolder interface in core, two views of the document are provided: an HTML string and a DOM object model.

We have decided that the DOM representation will contain structured data that has been appropriately "loaded" -- that is, operations provided to core that operate on structured values will work immediately on the DOM returned without requiring an explicit load step. An equivalent to DOMDataUtils::getJSONAttribute() will be provided in core (or more likely, in an HTML library which may also contain parts of Parsoid's DOMCompat library) which will work on the DOM as returned by HtmlHolder. This is consistent with either an eager "load" step occurring after string-form HTML is parsed, or with a lazy load step integrated with the implementation of the structured value API provided to core.

The HTML string provided by HtmlHolder will be the "naive" inline-attribute serialization of the document, not one of the alternate encodings. When converting from DOM to a string, an appropriate "store" step will be performed to serialize the current live values of structured attributes. Private attributes like data-parsoid will not be stripped from the HTML string. HtmlHolder will therefore need to know about "structured value" HTML (again, as an abstraction provided by the HTML library used), but will not need to specially handle data-parsoid or any other Parsoid-internal attributes.

Serialization to ParserCache

Note that the actual representation stored in the ParserCache (ie, the serialized version of the HtmlHolder) does not need to be the same as the string form of the HTML returned by HtmlHolder. Optimized encodings could be utilized to reduce the "lots of escaped double-quotes and angle brackets" issue with the naive inline-attribute representation. The primary performance requirement is that, if policy decides that read view HTML is to be cached for a specific page, that read view HTML be able to be rendered directly from the serialized value stored in the ParserCache with minimal additional processing. But read view HTML is not expected to have many (if any) structured-value attributes in it. So long as optimized encodings do not touch the set of attributes present in read views, then read views ought to still be able to be served directly from the ParserCache representation. (Perhaps the optimized serialization can include a flag explicitly indicating when the optimized serialization is suitable for directly serving to clients, based on the absence of structured value attributes found.)

For edit views, the "inline-attribute" representation matches what the VisualEditor client expects, although currently data-parsoid is stripped by the API. The visual editor API which provides access to edit-mode HTML can choose to reimplement data-parsoid stripping for performance/bandwidth reasons, but it is not required. We are already serving content with inline data-parsoid to some VE clients, so the presence or absence of data-parsoid should not cause issues.

The precise details of the ParserCache serialization of HtmlHolder should as far as possible be hidden from clients, and changes made to the serialization format for performance or efficiency reasons should not affect the DOM model or HTML strings provided to callers.

It is worth noting that the JSON serializer used by ParserCache is currently implemented in MediaWiki core. Although probably not strictly required, a json codec implemented in an external library would be helpful in ensuring that HtmlHelper is deserialized as an object of the correct type: T346829.

Enumeration of fragments and metadata

In addition to an HTML Document, wikitext parsing results in a collection of metadata. Historically that metadata was stored in the PageBundle and returned to API clients as JSON, although some portions of the metadata were also returned as HTTP headers in the REST response. The integration of Parsoid with core has eliminated the need for a REST API-focused PageBundle structure, and made available the much richer ParserOutput object to hold metadata generated by parsing. For compatibility with existing calling conventions and the REST API, methods in core exist to convert metadata stored in PageBundle objects to "extension data" stored in ParserOutput, and the ContentMetadataCollector interface in Parsoid exists to allow Parsoid to directly write metadata to the ParserOutput object held by core. We currently accommodate the encoding of structured attributes as a standalone map by embedding that map in the PageBundle, which is then reflected into the parsoid-page-bundle extension data key when the PageBundle is stored in a ParserOutput.

The richer variety of metadata represented by ParserOutput and newly-implemented by Parsoid introduced another issue: instead of one Document representing the entire result of the parse, certain piece of metadata were "HTML strings" and thus logically separate DocumentFragments generated by the parse. Many of these fragments were stripped HTML of one sort or another (page title, TOC entries) but, for example, the "page indicator" mechanism in core represented an entire wikitext fragment that certainly requires post-processing (localization) and likely requires appropriate representation of structured attributes within the fragment as well. Extension implementations seem to want to store Parsoid-generated document fragments in ParserOutput's extension data mechanism as well, for later use in a final composition step.

This raises two related questions:

  • Should short HTML fragments of this sort be represented by individual HtmlHolder objects? If the HtmlHolder objects are separate, is the "owner document" for each fragment unique as well, or are all fragments conceptually part of a single Document?
  • For post-processing passes which want to operate on all Parsoid-generated HTML (for example, user-specific localization), how can all such fragments be located within the ParserOutput (and its extension data) and enumerated so they can be appropriately transformed?

It's worth noting that similar questions arose in the Parsoid implementation regarding the "owner document" of fragments created internally during parse and that after much work most fragments in Parsoid now share the same owner document (although an awkward Remex API means many of these fragments are created as separate documents that then have to be adopted by the main owner). Unifying the owner documents is not a complete solution to the enumeration question, however, since there exists no DOM API for enumerating all child fragments of a given owner document (and to do so would seem to require weak references at least).

Design decisions

The PageBundle data structure will be removed from Parsoid and moved from core's MediaWiki\Parser namespace into the REST API implementation, as a feature of the REST interface design but not a core Parsoid abstraction. The metadata written by Parsoid which is not already reflected by appropriate ParserOutput properties, such as specific content headers needed by the REST API, will be written by Parsoid directly to ParserOutput extension data using Parsoid's ContentMetadataCollection interface, either using the existing parsoid-page-bundle key or new keys specific to the particular metadata. The main Parsoid entrypoints will use HtmlHolder+ContentMetadataCollector as result types rather than page bundle; this will also avoid a serialization step and allow Parsoid to return its DOM result (with live structured attributes) directly to core. The XmlSerializer code can also be removed from Parsoid, since Parsoid's APIs will now be DOM-based. HtmlHolder plus support code for structured attributes will likely be moved to a library, which is probably also a good home for serialization code like XmlSerializer.

(Tentatively:) An API will be provided to store and fetch DocumentFragments by ID from ‎<template> elements in the Document ‎<head>. "Child" HtmlHolder instances will serialize themselves as simply the appropriate ID key, and they will fetch the appropriate DocumentFragment from the parent based on ID when necessary. This will allow storage of DocumentFragments (held by the HtmlHolder) in extension data or in ParserOutput fields. Live manipulation of structured data contained within these fragments will then be appropriately loaded and stored by the parent Document (held by its own HtmlHolder). Since these child fragments are part of the main document tree, they can be enumerated and mutated in-place by post-processing passes without explicit knowledge, and structured data attributes within the fragment will be transparently held by the DataBag or other mechanism used by the parent. Inside the HTML library, an API will allow easy creation of a new empty DocumentFragment/HtmlHolder tied to the owner document, as well as (for legacy compatibility) creating a new DocumentFragment /HtmlHolder tied to the owner document from an HTML string. Enumerating all fragments for post-processing can be done with Document::querySelectorAll('body, head > template'); this can also be exposed as an API helper method.

The JSON codec for child holders will need to use the codec context to ensure that child HtmlHolder objects are properly relinked to the parent on deserialization. The Slack discussion on T346829 seemed to get hung up on whether these sort of stateful deserializers should be "discouraged but possible" or whether the JSON codec wanted to explicitly prohibit anything but value objects. If serialization is restricted to simple value objects, then the HtmlHolder::getHtml() and HtmlHolder::getDom() methods need to include a parent object (ParserOutput, parent HtmlHolder, etc) as an explicit parameter (or as an explicit parameter of a similar-but-not-identical ChildHtmlHolder class) so that the child holder can be relinked to the parent Document after deserialization. (Perhaps even ParserOutput::getRawText() return a "child" HtmlHolder, with the contents of the ‎<body> tag a special case, and the full parent document is stored elsewhere. This makes all HtmlHolders "children" with references to a special ParentHtmlHolder; ie the parent is the exception, not the child.)

Note that the initial steps in the Rich Attributes proposal (before proposal 3) requires the caller to provide the context type for the deserializer, which is a different design that the JsonCodec used in core, which uses the more typical design where the serialized object contains its own type marker.

I believe that the parsing model for ‎<template> inside ‎<head> allows serialization of fragments in an appropriate way. ‎<meta> and ‎<link> tags are processed using the "in head" insertion mode, but this seems to match how they are processed "in body". If there is some issue it may be necessary to add another wrapper element inside the ‎<template> (like a ‎<body> tag) to reset the parsing mode so we're not "in template" and ‎<meta> etc tags are parsed properly.