In addition to the recordings, another important part of FUSE consists of transcriptions. The mark-up guidelines of the transcriptions follow the ones used in the SCOTS corpus but include some additional elements. The general mark-up guidelines from SCOTS can be seen below in the block quotation, followed by the additional tags and their descriptions.
Stretches where more than one participant is speaking at the same time are marked in the transcription by double slashes ( // ) surrounding the words which overlap:
Speaker 1: …although it might come across //as being arrogant.//
Speaker 2: //[laugh]//
Here Speaker 2’s laugh overlaps with the final three words of Speaker 1’s utterance.
Transcribers have used the following tags:
Sometimes words or sequences of words have been censored from documents, principally so that individuals may not be identified. Where this has been done, a Censored tag indicates what has been removed as follows:
“Don’t put your fingers in it though, [CENSORED: forename]. Cause you’ll be a mucky pup.”
Other items which may be censored include postal addresses, email addresses, place names, phone numbers, and company names.
Where this applies to an audio transcription, the corresponding section of the audio file has been replaced by a beep (or, in certain circumstances, silence). Censoring of personal or sensitive information has occasionally been necessary in written documents too, and is marked in the same way as above.
Words or longer stretches which the transcriber has not been able to hear or understand appear as follows:
“Yeah, what kind of cup was this [inaudible]?”
Parts of the transcription where the transcriber and checker are unsure are surrounded by question marks: [?]…[/?]
“they’re chaffin away [?]crattlin[/?] these toy cups”
Words marked as unclear are not indexed, and do not contribute to the word count.
False starts and truncation
False starts, stammering and truncated words are tagged and appear in the transcription followed by a hyphen:
“nineteen f-f-f- fifty-nine Triumph.”
“Everybody got a Chri- the whole class got a Christmas present.”
These are not indexed, and do not contribute to the word count.
Semi-lexical items (‘mmhm’, ‘erm’, ‘uh-huh’ etc) appear unmarked in the transcription, but are tagged in the underlying form.
Speaker 1: er not until I was kind of older
Speaker 2: uh-huh
Non-lexical sounds (coughs, sneezes, laughter, yawns etc.) appear between square brackets. These are not indexed and therefore will not be found using the Standard Search or Advanced Search features. Such items can be located in a document page, using the web browser’s find function.
“Five years till I’d done my apprenticeship [cough]”
“Yeah, I simply don’t [laugh] really remember.”
Non-linguistic “events” which may have an impact on language also appear between square brackets. Audible background “events” which do not affect the language used have not been transcribed.
“[phone rings] Eh? Right you go and do that, Toots. I’m no gonna answer it, we’ll get it later.”
The additional tag used in the current version of FUSE is categorized under pronunciation difficulties. Words or longer stretches that the transcriber has analyzed to pose significant pronunciation difficulties for an individual speaker are surrounded by prd tags: [prd]…[/prd]
“But hey, in Finland there is lack of er paper [prd]indu- industry- industry[/prd] and er they have er problems because people are l- doing like you and er the paper [prd]industry[/prd] [laugh] is er the- it’s the Fin- Finland’s first big thing //er before Nokia//”
Finnish and other L1 utterances vs English utterances
Finnish utterances are surrounded by fin tags: [fin]…[/fin]
“//Yeah// because… [fin]No[/fin] yeah well… yeah //[laugh]//”
There might be other L1s present in the utterances. These are marked by transcribers with three letter language codes. For example a Swedish utterance might be transcribed as [swe]…[/swe].
Finally, there is the question how to mark pauses in a transcript. In this ongoing project one important aim is to keep the transcriptions reader-friendly for a wide audience and therefore pauses that are over 400 milliseconds in duration are marked with three consecutive dots (i.e. …).
“M1: Alright erm… Ha- Have you read this new article about erm er Finnish ice hockey league er no Finnish ice hockey team Jokerit going to KHL?”