WEBVTT

00:00:01.100 --> 00:00:03.000
Hi, my name is Marc Schulder.

00:00:03.000 --> 00:00:05.500
My sign name is MARC

00:00:05.500 --> 00:00:09.199
and I am going to present the LREC 2022 paper

00:00:09.199 --> 00:00:11.366
<i>"How to be FAIR when you CARE:</i>

00:00:11.366 --> 00:00:13.199
<i>The DGS Corpus as a Case Study</i>

00:00:13.199 --> 00:00:16.100
<i>of Open Science Resources for Minority Languages"</i>

00:00:16.100 --> 00:00:20.066
which I wrote together with my co-author Thomas Hanke.

00:00:22.500 --> 00:00:25.633
A quick note for those who are not familiar with signed languages:

00:00:25.633 --> 00:00:28.800
DGS stands for <i>Deutsche Gebärdensprache</i>,

00:00:28.800 --> 00:00:30.633
meaning German Sign Language,

00:00:30.633 --> 00:00:33.200
which is a signed language that is primarily used

00:00:33.200 --> 00:00:35.200
in Germany and Luxembourg.

00:00:35.200 --> 00:00:37.333
Before we talk about the DGS corpus,

00:00:37.333 --> 00:00:40.533
let's briefly discuss what the FAIR and CARE principles are.

00:00:42.866 --> 00:00:46.466
FAIR was introduced as guiding principles for the creation of good data

00:00:46.466 --> 00:00:49.299
that is both human- and machine-readable.

00:00:49.299 --> 00:00:51.866
FAIR data should be <i>findable</i>,

00:00:51.866 --> 00:00:55.466
so you should use persistent unique identifiers like DOIs

00:00:55.466 --> 00:00:58.200
so that the data can always be found the same way.

00:00:58.200 --> 00:00:59.899
It should be <i>accessible</i>,

00:00:59.899 --> 00:01:03.799
meaning that there should be clearly defined ways to acquire the data.

00:01:03.799 --> 00:01:07.433
It should be <i>interoperable</i>, so instead of existing in a vacuum

00:01:07.433 --> 00:01:10.266
the data should indicate how it is interconnected

00:01:10.266 --> 00:01:13.766
and follow open standards that allow future connections;

00:01:13.766 --> 00:01:16.633
and most importantly, it should be <i>reusable</i>,

00:01:16.633 --> 00:01:18.766
allowing others to build on it.

00:01:18.766 --> 00:01:21.233
This includes providing strong documentation

00:01:21.233 --> 00:01:22.933
and a clear licence.

00:01:22.933 --> 00:01:25.599
FAIR encourages publishing data as open access,

00:01:25.599 --> 00:01:28.200
but it also acknowledges that sometimes

00:01:28.200 --> 00:01:31.500
there can be valid ethical reasons to restrict access.

00:01:32.133 --> 00:01:36.233
That brings us to the <i>CARE Principles for Indigenous Data Governance</i>.

00:01:37.000 --> 00:01:40.033
They were introduced as a response and complement to FAIR,

00:01:40.033 --> 00:01:42.299
providing ethical guidelines for the handling

00:01:42.299 --> 00:01:44.900
of data relating to Indigenous Peoples.

00:01:44.900 --> 00:01:47.700
While FAIR focusses on data publication,

00:01:47.700 --> 00:01:51.333
CARE is concerned with the entire data life cycle.

00:01:51.799 --> 00:01:55.766
CAREing data should exist in an ecosystem designed so that

00:01:55.766 --> 00:01:59.500
Indigenous Peoples can derive <i>collective benefit</i> from it.

00:01:59.500 --> 00:02:02.599
and it should empower their <i>authority to control</i> that data

00:02:02.599 --> 00:02:05.233
acknowledging their rights and interests.

00:02:05.233 --> 00:02:08.466
Those working with Indigenous data have a <i>responsibility</i>

00:02:08.466 --> 00:02:12.599
to share how this use benefits and empowers its Peoples.

00:02:12.599 --> 00:02:15.633
and the <i>ethics</i> of using Indigenous data require

00:02:15.633 --> 00:02:18.266
that the rights and wellbeing of Indigenous Peoples

00:02:18.266 --> 00:02:22.533
are the primary concern at all stages of the data life cycle.

00:02:23.000 --> 00:02:26.866
While CARE is specifically designed with Indigenous Peoples in mind

00:02:26.866 --> 00:02:29.833
it is also largely applicable to other minority populations,

00:02:29.833 --> 00:02:31.800
such as deaf communities,

00:02:31.800 --> 00:02:33.800
and it is in line with the ethical standards

00:02:33.800 --> 00:02:36.133
employed by our corpus project.

00:02:38.300 --> 00:02:40.500
Okay, so now that we have established

00:02:40.500 --> 00:02:42.733
what the FAIR and CARE principles are,

00:02:42.733 --> 00:02:44.766
let's talk about our dataset.

00:02:45.099 --> 00:02:49.366
The DGS Corpus is an annotated reference corpus of German Sign Language

00:02:49.366 --> 00:02:53.199
consisting of 560 hours of conversations.

00:02:53.599 --> 00:02:57.800
It was created as part of a 15 year project started in 2009

00:02:57.800 --> 00:03:00.833
and it includes both sign by sign annotations

00:03:00.833 --> 00:03:03.633
and translations into German and English.

00:03:03.933 --> 00:03:06.033
50 hours of that reference corpus

00:03:06.033 --> 00:03:09.599
are also publicly available as the Public DGS Corpus.

00:03:10.066 --> 00:03:12.699
If you are used to working only with spoken languages,

00:03:12.699 --> 00:03:14.966
50 hours or even 500

00:03:14.966 --> 00:03:17.333
will sound like a pretty small dataset to you,

00:03:17.333 --> 00:03:19.800
but this is in fact one of the largest corpora

00:03:19.800 --> 00:03:22.233
of a signed language out there.

00:03:22.966 --> 00:03:25.233
One reason for this is that signed languages

00:03:25.233 --> 00:03:27.333
have no commonly used written forms

00:03:27.333 --> 00:03:29.833
and annotating signed utterances properly

00:03:29.833 --> 00:03:31.933
is a very time-consuming process.

00:03:31.933 --> 00:03:33.800
One hour of recording needs about

00:03:33.800 --> 00:03:37.666
800 to 1000 hours of work before publication,

00:03:37.666 --> 00:03:40.966
setting hard constraints for how much of the reference corpus

00:03:40.966 --> 00:03:43.199
could be included in the public corpus.

00:03:43.199 --> 00:03:45.666
Publishing 50 hours of data allows us

00:03:45.666 --> 00:03:48.833
to present a cross-section of the overall corpus.

00:03:48.833 --> 00:03:52.433
giving a good impression of the
different kinds of contents that it covers.

00:03:54.199 --> 00:03:56.933
But let's take a step back and start at the beginning.

00:03:57.233 --> 00:04:00.233
The primary stakeholders in the DGS language community

00:04:00.233 --> 00:04:02.733
are members of the deaf community in Germany.

00:04:03.233 --> 00:04:06.733
Following the principle "nothing about us without us"

00:04:06.733 --> 00:04:09.833
the corpus project has always included deaf team members.

00:04:09.833 --> 00:04:13.000
In addition, a focus group of deaf users was formed

00:04:13.000 --> 00:04:14.766
to guide project decisions

00:04:14.766 --> 00:04:17.266
and assist us in connecting with the deaf community

00:04:17.266 --> 00:04:19.866
as well as keeping it informed about the project.

00:04:20.233 --> 00:04:22.199
To ensure collective benefit,

00:04:22.199 --> 00:04:26.333
the corpus was designed so that it would be both a source for linguistic research

00:04:26.333 --> 00:04:28.933
and a record of deaf culture,

00:04:28.933 --> 00:04:31.033
covering general life experience,

00:04:31.033 --> 00:04:33.266
deaf-specific experiences,

00:04:33.266 --> 00:04:35.166
perception of historical events,

00:04:35.166 --> 00:04:37.566
but also things like telling jokes.

00:04:37.566 --> 00:04:39.766
The goal was to create a resource

00:04:39.766 --> 00:04:42.466
that would be entertaining, informative,

00:04:42.466 --> 00:04:45.699
and that would support the identity of the community.

00:04:48.300 --> 00:04:52.600
The project recorded 330 participants from all across Germany

00:04:52.600 --> 00:04:55.933
whose primary language of daily life was DGS.

00:04:56.333 --> 00:04:59.066
Following the <i>authority to control</i> principle,

00:04:59.066 --> 00:05:02.399
informed consent was requested from all participants.

00:05:02.399 --> 00:05:05.866
This involved providing information in both DGS and German

00:05:05.866 --> 00:05:09.166
regarding the goals of the project, uses of the data,

00:05:09.166 --> 00:05:11.666
and what the rights of the participants are.

00:05:11.666 --> 00:05:16.000
These rights include restricting for what purposes the data may be shared

00:05:16.000 --> 00:05:19.733
and also reviewing the recordings to give or withhold their approval

00:05:19.733 --> 00:05:23.699
for either entire recordings or individual moments.

00:05:25.033 --> 00:05:27.333
Let's fast forward a few years.

00:05:27.333 --> 00:05:30.699
After a lot of work annotating and translating recordings

00:05:30.699 --> 00:05:35.399
the first full release of the public corpus was published in 2018.

00:05:35.399 --> 00:05:38.533
We also release updated versions on a regular basis

00:05:38.533 --> 00:05:42.466
to add more data, make corrections and to react to feedback.

00:05:42.466 --> 00:05:44.966
As I mentioned before, the DGS Corpus is

00:05:44.966 --> 00:05:48.699
both a linguistic resource and a record of deaf culture.

00:05:48.699 --> 00:05:50.766
So to maximise its <i>collective benefit</i>

00:05:50.766 --> 00:05:54.266
we released its data on two separate portals.

00:05:55.833 --> 00:05:57.933
The first one is <i>"My DGS"</i>,

00:05:57.933 --> 00:05:59.699
a community portal for deaf people

00:05:59.699 --> 00:06:02.733
and others interested in DGS and deaf culture.

00:06:03.266 --> 00:06:06.500
It provides all recordings with optional German subtitles

00:06:06.500 --> 00:06:09.466
and its design focusses on making it easy to find

00:06:09.466 --> 00:06:11.733
and watch interesting content.

00:06:12.100 --> 00:06:13.833
Here is a little example.

00:06:27.233 --> 00:06:30.100
The second portal is <i>"My DGS – annotated"</i>

00:06:30.100 --> 00:06:32.966
a research portal that provides the same recordings

00:06:32.966 --> 00:06:37.033
with full sign annotations and translations in German and English.

00:06:37.500 --> 00:06:39.399
All data is available to download,

00:06:39.399 --> 00:06:42.399
but can also be viewed in an online transcript viewer.

00:06:42.833 --> 00:06:45.966
Here is the video from before as seen through the research portal.

00:06:59.199 --> 00:07:01.899
The research portal also provides a type index

00:07:01.899 --> 00:07:05.033
of all unique signs occurring in the corpus.

00:07:09.899 --> 00:07:13.266
For each sign you receive
an overview of its corpus occurrences

00:07:13.266 --> 00:07:15.233
grouped by sign sense.

00:07:19.300 --> 00:07:21.566
and where possible a studio recording

00:07:21.566 --> 00:07:24.233
and phonetic transcription of its citation form

00:07:24.233 --> 00:07:27.166
as well as links to other lexical resources.

00:07:38.866 --> 00:07:40.899
The publication of the portals is also

00:07:40.899 --> 00:07:44.199
where FAIR joins CARE in our considerations.

00:07:44.199 --> 00:07:46.500
To make them reliably findable,

00:07:46.500 --> 00:07:50.133
each portal is treated as a separate but related dataset

00:07:50.133 --> 00:07:52.566
and given separate DOIs.

00:07:52.566 --> 00:07:56.300
For a simpler dataset, a single DOI would be sufficient,

00:07:56.300 --> 00:07:59.800
but for a complex dataset like the Public DGS Corpus

00:07:59.800 --> 00:08:04.199
we find it advisable to also
have identifiers for individual parts.

00:08:04.199 --> 00:08:07.500
So we create DOIs for each individual transcript

00:08:07.500 --> 00:08:10.633
as well as for each type in the type index.

00:08:10.633 --> 00:08:13.533
That way researchers can clearly specify

00:08:13.533 --> 00:08:17.266
which transcripts or signs they refer to in their research.

00:08:17.266 --> 00:08:19.000
On top of that, whenever we release

00:08:19.000 --> 00:08:21.100
an updated version of the corpus

00:08:21.100 --> 00:08:25.300
every component that has changed
also receives a new DOI.

00:08:25.300 --> 00:08:26.733
So it is always clear

00:08:26.733 --> 00:08:29.333
which version of the corpus is being referred to.

00:08:29.666 --> 00:08:32.733
All of these DOIs are then given qualified references

00:08:32.733 --> 00:08:35.933
to clarify how they are related to each other.

00:08:38.133 --> 00:08:41.666
Each DOI also comes with a set of machine-readable metadata,

00:08:41.666 --> 00:08:43.899
covering general dataset information

00:08:43.899 --> 00:08:46.566
like its name, authors, release date

00:08:46.566 --> 00:08:49.633
and all those qualified references I just mentioned.

00:08:49.633 --> 00:08:52.533
For information that is more specific to language data

00:08:52.533 --> 00:08:55.700
we also provide a CMDI file for each transcript.

00:08:55.700 --> 00:09:00.100
In there we specify metadata about the participants, elicitation tasks,

00:09:00.100 --> 00:09:03.700
the languages of the primary and secondary data and so on.

00:09:05.733 --> 00:09:09.866
Apart from this metadata, the corpus has a wealth of documentation.

00:09:09.866 --> 00:09:11.833
In addition to peer-reviewed articles,

00:09:11.833 --> 00:09:13.666
there are over twenty project notes

00:09:13.666 --> 00:09:16.133
documenting various aspects of the project.

00:09:21.033 --> 00:09:22.799
This includes a Data Statement,

00:09:22.799 --> 00:09:25.000
a document type specifically designed

00:09:25.000 --> 00:09:27.899
to help researchers understand the background of a dataset

00:09:27.899 --> 00:09:30.799
and anticipate its inherent biases.

00:09:31.166 --> 00:09:35.366
Of course, all projects notes have their own version-controlled DOIs.

00:09:36.833 --> 00:09:39.666
And that brings me to the end of my presentation.

00:09:39.666 --> 00:09:41.666
Of course, I only skimmed the surface

00:09:41.666 --> 00:09:43.500
and had to skip a number of aspects,

00:09:43.500 --> 00:09:46.500
such as usage licence, anonymisation,

00:09:46.500 --> 00:09:49.033
or the archiving of the original recordings,

00:09:49.033 --> 00:09:51.666
all of which do play a role in the CAREful

00:09:51.666 --> 00:09:54.299
and FAIR handling of the DGS Corpus.

00:09:56.566 --> 00:09:58.466
So if you would like to learn more,

00:09:58.466 --> 00:09:59.899
please read our paper!

00:09:59.899 --> 00:10:01.733
And if you are attending LREC in person,

00:10:01.733 --> 00:10:03.133
come by our poster.

00:10:03.133 --> 00:10:05.799
And, of course, go and check out the corpus itself

00:10:05.799 --> 00:10:07.700
and the many stories it contains.

00:10:07.933 --> 00:10:09.733
Thank you!
