Introduction
The FUSE site is here to serve teachers, students and scholars of English who want to search, discover, analyze and comment how Finnish upper secondary school students use spoken English in the Oral Examination of English in Finnish Upper Secondary Schools. More details about the exam can be found on the website of the Finnish National Board of Education (in Finnish).
Currently the FUSE corpus includes transcribed recordings of Finnish upper secondary school student pairs who have taken part in the Oral Examination of English starting from the academic year 2014. The amount of recordings and transcriptions will be increased in the future.
Besides providing you and your target group a chance to focus on aspects of English spoken by Finnish students in the existing database, the aim of this corpus differs from other speech corpora. The keyword is collaboration. FUSE wants to encourage English teachers across Finland to participate in the creation of a unique, nationwide corpus.
Background
The FUSE project has its roots in Scotland and more specifically the SCOTS corpus. While studying at the University of Glasgow, I was introduced to a corpus that is special for many features including its content, user interface and intended audience. The characteristics of the SCOTS corpus proved that it is possible to create a linguistic resource that is easy to access and use. This is only one of the reasons why I am grateful for the pioneering work the SCOTS project team has done for future generations. For a more detailed description of the technical aspects of the SCOTS corpus, I recommend reading an article written by Dave Beavan titled Computational challenges, innovations and future of Scottish corpora.
Content and mark-up
The primary data presented in FUSE consists of recorded pair conversations as stated in the introduction. The exam recordings have been edited by the administrator (Lasse Ehrnrooth) of this website so that their file format is MP3. The length of each recording covers the third part of the Oral Examination of English, which can be considered as the least structured part of the exam. In addition to the recordings, another important part of FUSE consists of transcriptions. The mark-up guidelines of the transcriptions follow the ones used in the SCOTS corpus but include some additional elements. The general mark-up guidelines from SCOTS can be seen below in the block quotation, followed by the additional tags and their descriptions.
Overlap
Stretches where more than one participant is speaking at the same time are marked in the transcription by double slashes ( // ) surrounding the words which overlap:
Speaker 1: …although it might come across //as being arrogant.//
Speaker 2: //[laugh]//Here Speaker 2’s laugh overlaps with the final three words of Speaker 1’s utterance.
Mark-up
Transcribers have used the following tags:
Censored
Sometimes words or sequences of words have been censored from documents, principally so that individuals may not be identified. Where this has been done, a Censored tag indicates what has been removed as follows:
“Don’t put your fingers in it though, [CENSORED: forename]. Cause you’ll be a mucky pup.”
Other items which may be censored include postal addresses, email addresses, place names, phone numbers, and company names.
Where this applies to an audio transcription, the corresponding section of the audio file has been replaced by a beep (or, in certain circumstances, silence). Censoring of personal or sensitive information has occasionally been necessary in written documents too, and is marked in the same way as above.
Inaudible
Words or longer stretches which the transcriber has not been able to hear or understand appear as follows:
“Yeah, what kind of cup was this [inaudible]?”
Unclear
Parts of the transcription where the transcriber and checker are unsure are surrounded by question marks: [?]…[/?]
“they’re chaffin away [?]crattlin[/?] these toy cups”
Words marked as unclear are not indexed, and do not contribute to the word count.
False starts and truncation
False starts, stammering and truncated words are tagged and appear in the transcription followed by a hyphen:
“nineteen f-f-f- fifty-nine Triumph.”
“Everybody got a Chri- the whole class got a Christmas present.”These are not indexed, and do not contribute to the word count.
Semi-lexical items
Semi-lexical items (‘mmhm’, ‘erm’, ‘uh-huh’ etc) appear unmarked in the transcription, but are tagged in the underlying form.
Speaker 1: er not until I was kind of older
Speaker 2: uh-huhNon-lexical items
Non-lexical sounds (coughs, sneezes, laughter, yawns etc.) appear between square brackets. These are not indexed and therefore will not be found using the Standard Search or Advanced Search features. Such items can be located in a document page, using the web browser’s find function.
“Five years till I’d done my apprenticeship [cough]”
“Yeah, I simply don’t [laugh] really remember.”Non-linguistic events
Non-linguistic “events” which may have an impact on language also appear between square brackets. Audible background “events” which do not affect the language used have not been transcribed.
“[phone rings] Eh? Right you go and do that, Toots. I’m no gonna answer it, we’ll get it later.”
Pronunciation difficulties
The additional tag used in the current version of FUSE is categorized under pronunciation difficulties. Words or longer stretches that the transcriber has analyzed to pose significant pronunciation difficulties for an individual speaker are surrounded by prd tags: [prd]…[/prd]
“But hey, in Finland there is lack of er paper [prd]indu- industry- industry[/prd] and er they have er problems because people are l- doing like you and er the paper [prd]industry[/prd] [laugh] is er the- it’s the Fin- Finland’s first big thing //er before Nokia//”
Finnish vs English utterances
Some Finnish utterances that might get confused with English are surrounded by fin tags: [fin]…[/fin]
“//Yeah// because… [fin]No[/fin] yeah well… yeah //[laugh]//”
Pauses
Finally, there is the question how to mark pauses in a transcript. In this ongoing project one important aim is to keep the transcriptions reader-friendly for a wide audience and therefore pauses that are over 400 milliseconds in duration are marked with three consecutive dots (i.e. …).
“M1: Alright erm… Ha- Have you read this new article about erm er Finnish ice hockey league er no Finnish ice hockey team Jokerit going to KHL?”