The KIParla corpus
The KIParla corpus is a new resource for the study of spoken Italian and is the result of a collaboration between the Universities of Bologna and Turin.
The corpus has several innovative features such as:
- Access to a large amount of metadata about the speakers and the contexts in which the recorded interactions took place;
- The ability to consult the corpus online and have access to the entire transcript of each conversation;
- The alignment of the transcript with the audio track.
Corpus design
Geographical differentiation is preeminent in characterizing sociolinguistic variation in Italian; indeed, even in the most controlled productions of educated speakers it is possible to detect the presence of regional traits.
In the KIParla corpus, linguistic data were initially collected in the cities of Bologna and Turin; the sociolinguistic situation of the two points of inquiry is characterized by the coexistence of Italian and dialect. In addition, although with major differences, both cities have been and are destinations of internal mobility, as well as of external migration flows; therefore, several regional Italians and Italo-Romance dialects can be found there, as well as languages of recent immigration. For this reason, in addition to information on where the recording was made, data on the geographic origin of individual speakers are also accessible.
With the addition of the KIPasti module, records collected in all geographic areas of Italy were integrated.
The speakers involved in the recordings are differentiated primarily by age, educational qualification and occupation, which are particularly significant parameters in determining the social location of individuals.
In the corpus, there are various types of interactions (such as semistructured interviews, table conversations and, in the university context, lectures and exams), differentiated according to situational parameters: symmetrical/asymmetrical relationship between participants, presence/absence of a predefined topic, presence/absence of norms for turn-taking, etc.
Corpus construction: data collection, transcription and accessibility
All data were recorded by overt microphone, and all speakers signed an informed consent (drafted in compliance with current European data protection regulations-see G.D.P.R.) authorizing:
- data collection;
- The storage of data on hardware located in European countries and/or on cloud services provided by universities;
- The publication of data online to carry out scientific research.
Before being uploaded online, the data (both audio files and transcripts) were anonymized and the only sensitive data, accessible upon registration, is the voice of the speaker himself. In the transcripts, sensitive data have been replaced; in the audio files, they have been covered.
The recordings were transcribed using ELAN software, which allows alignment of the transcript with the audio track.
For transcripts, a simplified version of the Jefferson system (see Tab. 1), frequently used in conversation analysis, was adopted.
, | Rising intonation |
. | Descending intonation |
: | Prolonged sound |
(.) | Short break |
> ciao < | Pronunciation (faster) |
<ciao> | Pronunciation (slower) |
[hello] | Overlaps between speakers |
(hello) | Text difficult to understand (transcriber's hypothesis) |
xxx | Unintelligible text |
((laughs)) | Nonverbal behavior |
= | Prosodically joined units |
Table 1. Symbols for transcription
In order to make the entire corpus searchable through NoSketch Engine, a script in python was developed that allows:
- Use metadata both as search filters and as information about individual records;
- Carry out research considering simple spelling and Jefferson transcription;
- Link each occurrence with the intonational unit it is in;
- Consult each form separately.
Incremental modularity
A key feature that makes the KIParla corpus particularly innovative is its incremental modularity, that is, its internal organization into independent modules and the possibility of adding new modules over time.
Modules are different corpora of spoken Italian that share the same design and a common set of metadata, transcribed by ELAN and made available through NoSketch Engine. The modules can focus on different dimensions of linguistic variation and can collect data from different geographic areas. However, the shared data collection and processing procedure ensures a high level of mutual comparability.
The full accessibility of metadata makes the corpus easily expandable, through the addition of additional modules focusing on different geographic, socio-cultural or communicative aspects, and updatable, through the addition of new data for existing modules. The very nature of the KIParla corpus makes it a potential monitor corpus, open to additions and updates over time.
To date, the KIParla corpus consists of three modules:
The broader the spectrum of interactions collected and the more socio-geographically differentiated the sample of speakers involved, the more representative the corpus will be of the languages and language varieties spoken in Italy.
We envision the KIParla corpus increasing in volume over time following two main directions. On the one hand, we aim to collaborate with existing projects in order to see if ready-made data collected for different purposes can be adapted to form new modules of the KIParla corpus. The only requirement in these cases is traceability and accessibility to (at least) a core set of metadata for speakers (gender, age, geographic origin, education level and profession) and interaction (interview, free conversation, etc.). On the other hand, we would like to initiate new data collections in different regions.
In the future, we also plan two annotation steps, namely lemmatization and POS tagging.
English Version
You can find an extended English description here.
Reference:
Mauri, Caterina, Silvia Ballarè, Eugenio Goria, Massimo Cerruti & Francesco Suriano, (2019) "KIParla corpus: a new resource for spoken Italian." In: Bernardi, Raffaella, Roberto Navigli & Giovanni Semeraro (eds.), Proceedings of the 6th Italian Conference on Computational Linguistics CLiC-it.