Issue 402: representing compound name strings
Posted by Martin on 15/11/2018
Dear All,
I would expect that the library or archival community do have a good practice how to "squeeze" a compound name, such as :
"His Majesty Dr. Snoopy Hickup Miller Jr", with respective separators, in a machine readable string, that could be used as custom datatype in an rdfs:Literal as one instance of Appellation, rather than defining all possible name constituents as individual rdf properties.
Could be a MARC string? XML? TEI?
This would be very helpful for our users.
Posted by Richard Light on 21/11/2018
On 15/11/2018 21:28, Martin Doerr wrote:
> Dear All,
>
> I would expect that the library or archival community do have a good practice how to "squeeze" a compound name, such as :
> "His Majesty Dr. Snoopy Hickup Miller Jr", with respective separators, in a machine readable string, that could be used as custom datatype in an rdfs:Literal as one instance of Appellation, rather than defining all possible name constituents as individual rdf properties.
>
> Could be a MARC string? XML? TEI?
>
> This would be very helpful for our users.
Martin,
I'm pretty sure that the most recent attempt at doing this will be the subfield markers ($a, etc.) in MARC. which date from the era of punched cards. The requirement that all of the name appears in a single string will rule out anything that might have been done in XML (where you might typically use attributes or subelements) or TEI (which is, after all, simply an XML application).
It's a nice idea, which follows the approach of encoding one 'compound' value as a single string, but I don't think we will find a ready-made standard for it.
Posted by Martin on 22/11/2018
Dear Richard,
XML is even better. The distinction between XML tags and MARC subfield markers is not so substantial. An XML file is still a string. The question is about RDF, putting a compound into rdfs:Literal.
So, again, is there a good practice with XML elements ????
Posted by Robert Sanderson on 22/11/2018
My concern with this approach is that standard mechanisms for interacting with the data will not expect these sorts of compound values. This would also affect other ongoing discussions, such as compound monetary amounts or other dimensions.
For example, if there are subfield indicators or XML elements embedded within a literal, rather than using the model to manage this information, queries at the model level will not work. If “Dr” is not a separate Appellation from “Snoopy”, with an appropriate Type associated with it to ensure it is known to be a prefix rather than a first name, it will be invisible to SPARQL or any other graph query language.
For names, which already support partitioning, the answer seems obvious to me that we should continue to use the model as intended. The consistency for compound dimensions needs further discussion. Similarly the value range for dimensions should follow existing patterns (P81a anyone?) rather than trying to embed one format within another.
Posted by Øyvind on 22/11/2018
Dear Martin,
this is how the TEI would do it: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ND.html#NDPER
So something like:
<persName><roleName type=”royal”>His Majesty</roleName> <roleName type=”academic”>Dr.</roleName> <forename>Snoopy</forename> <addName type=”nickname”>Hickup</addName> <surname>Miller</surname> <genName>Jr</genName></persName>
But one would need guidelines, esp. for the type attribute. In TEI everything can be done, and in several ways...
Posted by Richard Light on 22/11/2018
On 21/11/2018 22:43, Robert Sanderson wrote:
>
> All,
>
> My concern with this approach is that standard mechanisms for interacting with the data will not expect these sorts of compound values. This would also affect other ongoing discussions, such as compound monetary amounts or other dimensions.
>
> For example, if there are subfield indicators or XML elements embedded within a literal, rather than using the model to manage this information, queries at the model level will not work. If “Dr” is not a separate Appellation from “Snoopy”, with an appropriate Type associated with it to ensure it is known to be a prefix rather than a first name, it will be invisible to SPARQL or any other graph query language.
>
> For names, which already support partitioning, the answer seems obvious to me that we should continue to use the model as intended. The consistency for compound dimensions needs further discussion. Similarly the value range for dimensions should follow existing patterns (P81a anyone?) rather than trying to embed one format within another.
Martin's original suggestion involved identifying contexts where we could express compound values as a single string. This approach potentially has merit where such a string, as a whole, is in a format which is meaningful within existing systems and processible by existing software. As you say, there is a direct trade-off between the convenience and structural simplicity of having a single string (and an associated single 'unit') and the [lack of] potential for native RDF querying of the contents of that string.
I think it is more of a loss to be unable to query on people with forename "Richard" than to be unable to query all dimensions involving '6 inches'. So I agree that we should not pursue this particular line of thought.
As regards using an XML encoding within a literal, I think this would be a really bad idea. It would require the provision of an XML parser and support tools within the context of all RDF serializations (Turtle, JSON, ...). RDF/XML has provision for embedded XML, but this wouldn't help for any other serialization of the RDF.
Posted by Christian Emil on 22/11/2018
As Richard and Øyvind show, it is always possible to encode infomormation into a string both in MARC and in XML. It is of course also possible to use JSON for those prefering that schema formalism. MARC is old, that is correct and it is not very readable. I would dare to say that MARC and JSON share the unreadability property. MARC and TEI represent predefined but flexible encoding schemas. XML and JSON are general encoding formalisms.
The main point is that data encoded in XML, TEI-XML, MARC, JSON are strings which can be represented as literals in RDF. However, if one want to decode the string into structured information one need to know (and agree on) the encoding schema. As Øyvind points out, there are many ways to encode the same information in TEI as well as in MARC.
Posted by Martin on 22/11/2018
Dear Richard, Robert,
It is simply wrong that encoding structured data into an rdfs:Literal makes it invisible to SPARQL. It is exactly what xsd:dateTime does. The year, month, etc., is available to querying individually in SPARQL, not by magic but by a standard extension mechanism. It is a question to IT experts to tell us how to upload into the SPARQL code the respective string functions for other compounds. If we decide one standard way to encode the person name compounds, that would be quite feasible. Interoperability is in any case given with a trivial mapping, because standard SPARQL recognizes any custom datatype. Of course we would also provide standard string functions to take the compound apart. For this discussion we need a completely informed decision.
We must really be more aware how badly current RDF platforms still perform with longer property paths. There are good reasons why time, geometry and others are not encoded with rdf properties.
The first question we have to answer is A) how many compounds we need that must be queried component-wise. Then we should find B) the best XML representation regardless platforms. Then we discuss C) how that should go into RDFS.
I propose for A):
1) miles-yards.... American Standard Lengths "A mile is exactly 1.609344 kilometers. Yes, the mile has a metric definition." (https://www.mathsisfun.com/measure/us-standard-length.html)
2) Person Name compounds,
3) Street address compounds
I propose for B)
2) following either TEI or RDA guidlines. I do not propose to use MARC tags as is. The translation into XML elements is trivial syntactic sugar (and exists, I think). The relevant question is, if the analysis is effective or not.
I propose for C)
to find out if anybody has solved the problem already.
So, does anybody propose a good-practice analysis of name compounds?
Posted by Conal Tuohy on 25/11/2018
I think it's potentially helpful to encode compound data such as personal names using XML literals in an RDF graph, for display purposes, but not for SPARQL querying. For efficient querying, I don't see any good alternative to providing separate literals for the individual components of the name, such as with "foreName", "surname", etc properties in separate RDF triples. I suggest that RDF encoding guidelines could suggest adopting both practices (i.e. redundant representation both as parts and also as a whole).
On Fri, 23 Nov 2018 at 03:53, Martin Doerr <martin@ics.forth.gr> wrote:
Dear Richard, Robert,
It is simply wrong that encoding structured data into an rdfs:Literal makes it invisible to SPARQL. It is exactly what xsd:dateTime does. The year, month, etc., is available to querying individually in SPARQL, not by magic but by a standard extension mechanism.
The date functions in SPARQL that allow an xsd:dateTime literal to be parsed into months, days, etc, are not really an extension to SPARQL; they are part of the SPARQL language standard:
<https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#func-date-time>. Because they are a standard data type in SPARQL, a SPARQL processor can achieve efficiencies by normalizing them (to a standard time zone) and using the normalized form in comparisons.
The SPARQL specification does allow for SPARQL implementations to have "extension" functions, though, and to extend the operation of built-in SPARQL operators such as "<" or "=", so hypothetically a SPARQL store might offer XPath-evaluation functions to query inside XML literals, analogously to the way that the REGEX and REPLACE functions do with string literals. This kind of hybrid RDF graph/XML tree model could be supported effectively by a SPARQL store which maintained indices of the tree structure of the XML literal objects it contained. I believe Virtuoso actually has such a feature, and there may well be other SPARQL engines with a similar feature, but I personally think it would be unhelpful for the CRM to suggest an approach that depends on such a non-standardised extension.
It is a question to IT experts to tell us how to upload into the SPARQL code the respective string functions for other compounds.
The standard SPARQL string functions (including regular expression) can be used to parse "compound" string literals, though not to parse XML literals, in general, since XML is not a regular language. Of course the CIDOC CRM could suggest "regular" XML encodings for particular types of compound literals; for example a "persName" data type could be defined and constrained with a regular expression to require that it begins with "<persName xmlns='http://www.tei-c.org/ns/1.0'>" and ends with "</persName>", optionally containing child elements beginning with "<forename>" and ending with "</forename>", and even for these elements to have attributes (such as 'type') drawn from a particular value space. They could be queried using SPARQL string functions e.g. like so:
SELECT ?person
WHERE {
?person tei:persName ?persName.
FILTER(CONTAINS(?persName, '<foreName>Richard</foreName>'))
}
However, relying on SPARQL FILTER and string-parsing would be grossly inefficient in terms of query performance, compared to querying individual properties, e.g.
SELECT ?person
WHERE {
?person tei:foreName 'Richard'.
}
If the "compound" XML literals are not intended for fine-grained querying, they can still be valuable for display purposes, but I don't see much value in constraining them beyond the general "XML literal" datatype. An information system that understands XML literals can examine the XML and process it appropriately based on its namespace.
--
Posted by Robert Sanderson on 28/11/2018
Action from the SIG meeting to send information about partitioning of names:
Personal Names:
MARC has three subfields for name, in the bibliographic USMARC:
https://www.loc.gov/marc/bibliographic/bd100.html
Which has a lot of name fields, but also a lot of related things to a name (such as date of a work in subfield f)
And the equivalent in MODS, for the type of namePart:
https://www.loc.gov/standards/mods/userguide/name.html#namepart
given, family, date, and termsOfAddress
In the Getty AAT vocabulary, we have the following types of names
http://www.getty.edu/vow/AATHierarchy?find=&logic=AND¬e=&subjectid=300266386
Which include both type of the complete name (e.g. noms de guerres) and parts of names (middle name).
And name related concepts generally
http://www.getty.edu/vow/AATHierarchy?find=&logic=AND¬e=&page=1&subjectid=300404653
Which includes prefix/suffix/title and similar.
Place Names:
For places, we have looked at the FGDC endorsed standard:
https://www.fgdc.gov/fgdc-news/fgdc-endorses-address-data-standard
https://www.fgdc.gov/standards/projects/address-data
Which is … comprehensive, to say the least. We then cherry-picked the bits that we thought most useful, given the level of data description that we need for cultural heritage purposes.
Posted by Thanasis on 28/11/2018
And some information from me about addresses:
OASIS xAL (extensible Address Language) as described here:
https://www.oasis-open.org/committees/ciq/download.shtml
The W3C Schema can be downloaded from here:
https://www.oasis-open.org/committees/ciq/Downloads/ciq_schemas.zip
In the 42nd joined meeting of the CIDOC CRM SIG and ISO/TC46/SC4/WG9 and the 35th FRBR - CIDOC CRM Harmonization meeting, it was agreed that the crm-sig should look at the best practices adopted by international communities, and then re-address the issue of finding a meaningful way to represent massive compound names (such as properties, person street compounds etc) in the next crm-sig meeting.
HW: NC, TV, MR (shall provide input about the meaning of the compounds), SS and RS are assigned with providing input.
Berlin, November 2018
Posted by Robert Sanderson on 30/11/2018
On 28.11.18 14:47, Robert Sanderson wrote:
>
> Action from the SIG meeting to send information about partitioning of names:
We are regularly fighting with bibliographical reference systems (Drupal
Biblio and bibcite modules, CSL styles, EndNote, Zotero) and how they
(not) deal with Arabic[1], Spanish[2], and Chinese names.
I have not done any extensive research but I have not seen any
encompassing support for schemes other than first-middle-last name or
consistent and usable rules for how to press e.g. Arabic names into
first-middle-last.
If anybody has pointers to good solutions I would be grateful
Cheers
Robert C.
[1]: https://en.wikipedia.org/wiki/Arabic_names
[2]: https://en.wikipedia.org/wiki/Spanish_naming_customs
Posted by Daria Hookk on 30/11/2018
Dear colleagues,
you completely forgot Russian names, where middle names does not exist, but middle part reffer to father's name,
hus mine is Daria Yurievna (father Yuri) Hookk.
Posted by Gordon Dunshire on 30/11/2018
All
IFLA has been gathering information on forms of names of persons:
https://www.ifla.org/node/4953
RDA: Resource Description and Access is developing instructions for creating access points for names. The wide variety of components of names, and cultural norms for reformatting components for indexing and browsing, means that RDA cannot, and will not, provide explicit instructions for every form of name. Instead, specific methods will be at the discretion of cataloguing and metadata agencies. The RDA instructions will ask for the “string encoding scheme” used to construct an access point to be specified as data provenance. This is the equivalent of Dublin Core’s syntax encoding scheme (SES), which specifies how a string is constructed from component strings, covering standard and ad hoc XML datatypes.
Separating the SES as an external document and linking it as provenance to a reified triple that stores the resulting string value seems to be the only way to avoid embedding name/value information within the string itself.
This is a generic approach that also applies to the construction of citation strings, and other string values generated from component strings.
Note that a significant part of the International Standard Bibliographic Description (ISBD: https://www.ifla.org/files/assets/cataloguing/isbd/isbd-cons_20110321.pdf) is a set of string encoding schemes (punctuation patterns).
Posted by Richard Light on 30/11/2018
Gordon,
Thanks for this. The page you cite demonstrates graphically the wide variety of approaches to presenting personal names for indexing and browsing. In our SIG meeting earlier this week, we discussed this issue and agreed that what would interest us initially would be an analysis of the components of personal names. As you say, the logical way to use such an analysis would be to split the name into its component parts and record each separately.
I suggest that we park this useful discussion for future reference, and CLOSE this issue.
The CIDOC CRM Group of Editors decided to close this issue following Richard Light's advice.
14 June 2021