DOI: 10.25592/dgs.corpus-4.0

Welcome to the Public DGS Corpus Release 4!

In this portal you will find 50 hours of video materials from the DGS-Korpus project which have been made available along with annotations for research purposes.

If you want to download materials, please pay attention to the license conditions.

Experts in corpus query languages may want to check our MEINE DGS – ANNIS portal featuring almost the same dataset as this site, but allowing more complex searches than possible here.

By clicking Types, you can view the list of all sign types occurring in the public corpus. Click on one of these types to see all corresponding tokens in the public corpus. Clicking once again on a token reference brings directly to the occurrence in the transcript.

By clicking Formats, you get a list of all elicitation formats used, each with the number of transcripts for this format in the Public DGS Corpus. Click on one format to see more details on the format as well as the list of all transcripts, sorted by regions and topics.

For all transcripts, keywords have been assigned in order to provide a rough content-related access to the data. By clicking Keywords, you can view an index of all keywords and find the transcripts they have been assigned to.

Background Information

Data Elicitation Formats

A set of 20 different tasks for the participants were used. The formats ranged from story-retelling (with prompts in sign, picture, or movie) to discussions on a given topic as well as free conversations. With careful planning, it was possible to make the mix of formats diverting enough that most participants enjoyed the recording session despite a net length of 5 hours.

The set contains a number of tasks previously used in other corpus projects, both on spoken and sign languages, to lay a basis for cross-linguistic research, as well as new formats. Not all details of the newly developed elicitation materials are available in the publications in order to keep the material suitable for future data collections. The materials are, however, available to other researchers upon request.

Data Collection Regions

From experiences in earlier projects, it was one of the key decisions to have a mobile studio to be set up in different places across Germany. The idea was to have as much of a “local” spirit with the recording location in the region and all persons involved coming from that region while still ensuring high-quality recordings needed for transcription. Obviously, the number of locations selected for recordings needs to be a compromise between, on the one hand, localness in the above sense and the participants’ travel times, and, on the other hand, the logistics.

Our solution was the definition of thirteen data collection regions, trying to respect the catchment areas of current and former Schools for the Deaf, state (Bundesland) borders determining a. o. educational settings, especially the former border between West and East Germany, suspected dialectal borders, but also practical considerations such as travel times to the recording locations. The regions were further subdivided into up to five sub-regions relevant for participant selection. Large metropolitan areas form their own sub-regions, in contrast to others with mixed or more rural structures.

Below you find a map of Germany showing the data collection regions. For comparison, there is a map of Germany showing the states (Bundesländer).

ber: Berlin (Berlin, Brandenburg, partially Saxony-Anhalt) 6.18 M fra: Frankfurt (South Hesse, Saarland, partially Rhineland-Palatinate) 8.69 M goe: Göttingen (Hannover, South Lower Saxony, North Hesse) 5.53 M hb: Bremen (Bremen, North-West Lower Saxony) 3.28 M hh: Hamburg (Hamburg, North Lower Saxony) 2.82 M koe: Cologne (North Rhine, partially Rhineland-Palatinate) 10.84 M lei: Leipzig (Saxony, Thuringia, partially Saxony-Anhalt) 8.72 M mst: Münster (Westphalia, Osnabrück, County of Bentheim) 9.08 M mue: Munich (South Bavaria) 7.26 M mvp: Rostock (Mecklenburg-Vorpommern) 1.69 M nue: Nuremberg (North Bavaria) 5.23 M sh: Schleswig-Holstein 2.83 M stu: Stuttgart (Baden-Württemberg) 10.74 M Schleswig-Holstein 2.83 M Hamburg 1.75 M Lower Saxony 8.62 M Bremen 0.66 M North Rhine-Westphalia 18.03 M Hesse 6.08 M Rhineland-Palatinate 4.05 M Baden-Württemberg 10.74 M Bavaria 12.50 M Saarland 1.04 M Berlin 3.40 M Brandenburg 2.54 M Mecklenburg-West Pomerania 1.69 M Saxony 4.22 M Saxony-Anhalt 2.43 M Thuringia 2.31 M

Participants

Due to the lack of census data on the Deaf population, the target number of participants per region was based on the population figures of the general population, with a weight of 2 for larger cities to reflect the common (though unconfirmed) observation that Deaf people often prefer to live in larger cities. Together with a set minimum of 16 participants per region (to cover four age groups times two sexes with at least two participants each), this resulted in a target number of 328 participants. We actually filmed 330 participants.

In the map below, you find the number of participants from each region, detailed by age group.

ber: Berlin 8+9+7+4=28 fra: Frankfurt 8+10+7+7=32 goe: Göttingen 4+5+6+5=20 hb: Bremen 3+6+4+3=16 hh: Hamburg 1+6+5+4=16 koe: Cologne 12+11+11+10=44 lei: Leipzig 7+7+7+9=30 mst: Münster 7+9+8+8=32 mue: Munich 8+5+7+6=26 mvp: Rostock 4+4+4+4=16 nue: Nuremberg 6+5+3+4=18 sh: Schleswig-Holstein 4+2+7+3=16 stu: Stuttgart 9+12+7+8=36 18-30 31-45 46-60 61+

Across regions, the age groups are fairly well-balanced with respect to age groups, and perfectly with respect to gender.

40 45 42 38 165 male 41 46 41 37 165 female 81 91 83 75 330 total

Annotation Conventions applied in the DGS-Korpus Public Data

The annotation conventions are described in the project note AP03-2018-01.

File Formats available for Download

If you use iLex, please download the iLex file and import it into your iLex database. You may want to download the A, B and C camera perspectives as well in order to have them available locally. This is not strictly necessary as the iLex import file provides urls to access the files via https. In addition to the annotation, the iLex files contain metadata on the session as well as the participants.

If you use ELAN, please download the ELAN file and optionally the A, B and C movies, then open the ELAN file. Downloading the movie files is not strictly necessary as the ELAN file contains urls to access the files via https. However, this may turn out to be usefull for performance reasons when working in ELAN.

For other tools such as MaxQDA, it is often possible to import SRT (subtitle) files. Please note that the files linked differ between the English and the German pages. If the tool you are using can handle multiple video track files, download the A, B and C files. If the tool only accepts one file, you may want to use the AB movie file which is a side-by-side of the B and A perspective.

We make the pose analyses of the A and B camera perspectives and the corresponding side views available in different flavours: OpenPose, MediaPipe, AppleVision and Surrey 3D liftings. A download file contains the data for all four perspectives plus information on the spatial resolution of the input file (which is partially different from the resolution of the files offered here for download). Where the video is anonymised, the pose data contains empty coordinates arrays. For size reasons, the files are zipped. For details on how the pose data were processed, please cf. project note AP06-2019-01.

Finally, you can download a CMDI file containing metadata for the session as well as the participants.

Where studio recordings are available (see above), they are not only displayed on the detail pages reachable from the Types list, but are also available for download, accompanied by corresponding iLex, ELAN, and SRT files as well as pose data. Studio recordings are typically available from four perspectives: frontal, 45° to the side, 90° to the side, and from above. The pose data correspond to the first three of those perspectives. Please note that the studio recordings show the actors standing while the participants in the corpus recordings are seated.

Further Information

Our project note AP06-2020-01: Data Statement describes the Corpus in more detailed and provides references into detailed documentation. For a list of all our project, please click here.

For more detail specifically on the data collection formats, please consult the following publications:

  • Nishio, Rie / Hong, Sung-Eun / König, Susanne / Konrad, Reiner / Langer, Gabriele / Hanke, Thomas / Rathmann, Christian (2010). “Elicitation methods in the DGS (German Sign Language) Corpus Project”. Poster presented at the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, following the 2010 LREC Conference in Malta, May 22-23, 2010. Paris: ELRA, pp. 178-185. [Paper & Poster]
  • Hanke, Thomas / Hong, Sung-Eun / König, Susanne / Langer, Gabriele / Nishio, Rie / Rathmann, Christian (2010): “Designing Elicitation Stimuli and Tasks for the DGS Corpus Project”. Poster presented at the Theoretical Issues in Sign Language Research Conference (TISLR 10), Sept 30 – Oct 2, 2010 at Purdue University, Indiana, USA. [Poster]

How to Cite

We ask you to cite corresponding DGS-Korpus publications if you publish your research based on this material.

If you want to cite the dataset itself, please find the citation data here. In order to cite individual transcripts or type data or data elicitation formats, please use the DOIs shown on the respective web pages. By clicking on any DOI, you not only get a list of all versions of that transcript or type or format already published, but also find a version-independent DOI always referring to the latest version published of that transcript or type or data elicitation format.