Digitisation of the TIGR participant questionnaires
ShareTIGR
One of the first steps in the ShareTIGR project has been to digitise 115 questionnaires which include data about the participants in the TIGR recordings, specifically their age, sex, place of primary school, place of residence, place of work in Switzerland, language skills, profession, and educational qualifications. They do not contain any personal names, addresses or dates of birth and refer to the recordings by means of informant identifiers and event identifiers. The questionnaires had been filled in on paper right before the recordings were made and the first step of digitisation has been to manually transfer the data into an Excel table - a structured digital format that will facilitate further processing. While transferring the data, we made some decisions and adjustments that we will reflect upon in this blog post.
Let's start with two basic categories that the future user of the TIGR corpus may want to be informed about since they are relevant to interpret the recorded discourse at multiple levels: età ('age') and sesso ('sex').
The age of interlocutors is important metadata as it may influence the language variety spoken, the content and progression of the talk as well as the demeanour and participant role of a speaker. It is also relevant for quantitative research questions. In the questionnaires, the InfinIta project team asked for the age in years at the time of the recording. However, this format will probably not be seen in the published FAIR version of the corpus, as it is safer to give an age range rather than a specific number for data protection reasons.
When asking the participants to indicate their sex, the InfinIta team had offered the options “F” for female / femminile and “M” for male / maschile. While the majority of participants answered the question straightforwardly by ticking one of the two boxes, two people manually added the box “altro” (other). What is interesting about this behaviour is that none of the participants who had edited the questionnaires by hand actually ticked the third option. This suggests that no one needed the third option for self-categorisation purposes, but instead used it as a means to show criticism of the binary system employed by us. This way of engaging with the questionnaire in a manner that was not explicitly requested shows a form of negotiating normative social categories on a larger social level. Interestingly, the category “sex” was the only one that had received manual corrections, thus showing and repairing the set of social categories that were relevant to the participants. Since no one had ticked the third option, the comments had no technical or conceptual impact on our work, so they were not included in the Excel table (and hence the metadata of the corpus). However, to do justice to their social relevance, they could be mentioned in the corpus description.
The questionnaire further requested information about places, in particular the municipality of the participants’ primary school and the municipalities of their current residence and work or study. Such geographical information might be relevant to interpret the corpus data because language can be expected to vary depending on where people were brought up and live. The information given by the participants made up only a few words in the paper questionnaire but was expanded to multiple columns in the Excel table. We decided to enrich the place names by their province or canton, their region and country so that they could be filtered and grouped according to various factors. The result are three columns for Swiss locations (place, canton, country) and four for Italian locations (place, province, region, country). The Excel sheet shows a relatively high number of primary school and residence locations in Italy, which is due to the fact that many participants of TIGR are either cross-border commuters or Swiss residents of Italian origin.
Another item of the questionnaire concerned language skills. As other items, it had been phrased quite succinctly (lingue conosciute 'languages known') and gave the participants some liberty as to possible responses. Some informants who had declared to be fluent Italian speakers before accepting to participate in the research did not mention Italian in the questionnaire, thereby displaying an understanding of the prompt as regarding foreign language skills only. We then added Italian to the list of known languages in the table since this information is essential metadata for TIGR as a corpus of spoken Italian. Further variation arose around the interpretation of the category lingue: most informants named standard varieties, while some included both standard languages and regional varieties (dialetto ticinese, Svizzero tedesco) or local varieties (dialetto grosino). In addition, a small number of participants used parentheses or the explicit mention of comprehension skills to indicate a less than full competence in some variety. In all these cases, we kept the original statements at this stage of data processing.
Next came the answers about profession, which showed some variation in the case of apprentice and student participants. Some, but not all, indicated their branch of study or professional field; single students mentioned a part-time job as profession. To increase uniformity, and in the interest of data protection, we decided to keep only the most generic information about the type of training (apprendista, studente/studentessa), which was given by virtually all student participants, and to ignore any additional information provided only by some (apprendista muratore).
The last column of the questionnaire was istruzione (education). It offered some options to be ticked (scuole elementari, licenza di scuola media, formazione professionale, diploma di scuola media superiore, laurea triennale, laurea magistrale, dottorato) and a free text field (altro). Since the one-word prompt did not specify whether to indicate all qualifications or only the highest, we obtained both types of answers. As a general rule, we only retained the highest educational qualification in the table, but in the case of a skilled trade, people could have done a vocational training, a high school diploma or both, so we decided to list both qualifications when so declared. Also, some participants obtained their educational qualifications before the so-called Bologna reform of higher education in Europe was completed and, accordingly, provided answers under altro such as laurea vecchio ordinamento, which is formally equivalent to today's laurea magistrale. We decided to keep their original answers to avoid anachronistic categories. On the other hand, in single cases someone provided an answer under other that would have had a corresponding match in our scheme but had been overseen or misunderstood (e.g., istituto tecnico instead of diploma di scuola media superiore). We then changed the originally given answers to make them fit our scheme.
The supposedly simple task of digitising paper questionnaires confronted us with a number of unexpected issues. None of them were serious or challenging to overcome, but they showed that even “simple” tasks in the endeavour of making FAIR spoken language data and metadata available require careful and foresighted consideration.
Nina Profazi & Johanna Miecznikowski