1. Well-formedness

To be machine-readable, XML has to be well formed.  Attributes have to be enclosed in quotes; certain characters have to be escaped; all opening tags have to be matched with closing tags. A human reader may recognise easily that a closing tag has been omitted, and may be certain of where it should have been. But writing a processor that can guarantee to replace the missing closing tag in the right place in all circumstances – that’s a lot more difficult. Difficulty aside, can you guarantee that all such processors would come up with the same solution for every possible instance of broken-ness and can you be certain that they will not affect the financial values in the document?

Grammar matters, even to computers, and it’s there to make sure that each identifiable piece of data is recognised in the same way by any processor. There are plenty of open source XML processors which test for well-formedness – so there is no excuse for not including this in your processing chain.

Testing that a document is well-formed ensures that all the information in it can be addressed in the same way by all possible processors.

2. Schema-validity

Before XML was used, data transfer tended to depend upon serial data formats like magnetic tape. In a serial data format the data is recognised by the order in which it arrives. But in XML-based applications like iXBRL, which are typically processed using a document object model (DOM), each piece of data has its own location and can be accessed in any desired order. If there are any mistakes in the structure of the document, your ability to locate that data accurately could be compromised. But as long as the XML document is well-formed, you can be certain that every XML processor will identify each piece of data in that document in the same way.

So, knowing that the document is well formed means that it’s now possible to test whether the document contains the correct pieces of data. XML Schema defines the types of data in a document and the structure and order of the data. It is a mature standard, with good open source validators available, all with precisely the same interpretation of the XML Schema specification. This consistency is critical – it means that developers, users, and risk managers can be confident that a document which has passed schema-validation will conform in every respect to the definition in the relevant XML Schema.

Many data transfer projects will use XML Schema to define certain kinds of business rules – the length of a particular piece of data, the allowed values for a data item, the higher or lower limits for a number, and so on. It is cheaper and more effective to define these in XML Schema, where they will be tested directly by a validator, than to provide these instructions as written business rules in Message Implementation Guidelines or other human-readable documents which will then be used as a basis for hand-coded enforcement.

However, the non-serial structure of XML data also allows a document to contain extra data – data that may have been forbidden by the schema.  An unvalidated XML DOM may contain an unexpected item that will be ignored and unnoticed – but it will remain accessible for future use.  This has obvious security implications, as such a data item could consist of hostile processable code. XML Schema is important, then, not just to ensure that the correct data is in a document, but also to prevent unwanted data or code from being inserted into an otherwise usable document.

Testing that a document is schema-valid is the most accurate way of ensuring that the right data, and only the right data, is present.

3. Specification validity

The XBRL family of specifications uses XML Schema to enforce structural constraints, but there are other requirements which are marked “MUST” in the specification which are beyond the capacity of XML Schema to describe. These “MUSTs” can only be enforced by specialist XBRL validation software.

So, for instance, in the iXBRL specification the “{id} property” is required to be unique across the Inline XBRL Document Set. (Although XML Schema gives you id uniqueness checks “for free”, it only works in a single  document, whereas iXBRL needs uniqueness across a set of documents.)

As well as advanced structural constraints, specifications also include semantic constraints such as the requirement that “endDate” must be later than “startDate”, or that XBRL processors must detect cycles in parent-child relationships.

In all these cases, non-compliance will lead to situations where processors cannot operate without making assumptions about the data for which there is no clear evidence. Different processors will take the same documents and will inevitably differ in the results they give.

Specification-validity provides certainty that the data will be processed the same way in all compliant XBRL processing systems.

4. Taxonomy validity

For an XBRL instance document to be interpreted correctly, it must reference a valid XBRL taxonomy. An XBRL viewer will label the data it displays with the labels that are defined in the underlying taxonomy.  If the taxonomy is not correctly constructed, this and other operations will fail unpredictably. This may result in concepts appearing with the wrong labels or the wrong legal references – potentially misleading, or worse. Taxonomy validation will not prevent the construction of poor taxonomies, but it will ensure that all XBRL processors are able to interpret XBRL data the same way.

As well as providing meaning to the facts in an XBRL data document, in the form of labels and references, a valid taxonomy constrains the allowed values for facts and dimensions. For instance, it prevents the use of values like “England” or “Oxford” as members for a region dimension that expects UK counties.

Taxonomy validity prevents compliant processors from interpreting XBRL data in different ways.

5. Business rule compliance

Domain-level constraints on data are usually enforced through business rules which operate on a logical model of an XBRL report that has already passed syntax checks and basic semantic checks. Some of these rules can be defined using specialist languages (such as XBRL Formula or CoreFiling’s Sphinx). They can include data evaluation tests such as “global revenue MUST equal the sum of the revenue figures for the regions”.

Others constraints are defined as human-readable rules in filer manuals published by authorities such as the SEC, ESMA or HMRC. Such rules are usually driven by legislative rules controlling a data collection programme or by the need to match end-user database content requirements.

Enforcement of business rules is the last line of defence before ingesting third party data into internal processing systems.

Validation as a ladder

These five levels of validation are, effectively, a ladder, each set of tests depending upon the ones below it.

As a general principle, any rule should be defined as low as possible on the ladder. Lower levels are easier and cheaper to enforce. Thus, don’t define a rule in a filer manual if it can be defined in XBRL Formula; don’t use formulae if the rule can be encapsulated in XML Schema. This keeps implementation as simple as possible and ensures highest levels of compliance.

Secondly, rule-writing can (and should) assume compliance at each previous level.  Business rules, for instance, should be written on the expectation that document content rules defined in the schema have been met in full.  This keeps the number of rules to a minimum and keeps preparation and ingestion costs as low as possible.

And because each level of validation assumes that the data complies with the tests defined at the previous level, it follows that all levels of the ladder should be tested, in turn, to ensure data compliance.

Defining rules at the right level in the XBRL stack and enforcing them in the right order will minimise both preparation and consumption costs.