Distribution of Speech & Silence Durations (3)

Linguistics 521 Exercise #4: Making your .trs files accessible

As noted previously,

In doing corpus phonetics, you'll often be faced with stuff in one format that you want to transform into another format. For example, you might have output from a forced aligner program, and you'd like to create a corresponding Praat .TextGrid file, or a Praat script to let you inspect and annotate certain cases efficiently. You might have some Praat .TextGrid files, and you want to turn them into a form that R can digest. We'll see many examples of this kind of problem in the course of the course, and many others will come up in any project that you do.

So for this lesson, our goal is to turn the .trs file produce by the Transcriber program into a form that can be assimilated easily by other programs. We'll do this in three steps: I'll give you programs to accomplish the first two steps, and the skeleton of a program to accomplish the third step. The minimal assignment is to finish the third program and show that it works. (If you feel ambitious, or are already familiar with the programming concepts involved, you can write your own version of the 2nd program, or all three of them; or do it all in one step...)

We're starting with an xml file that looks something like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-13.dtd">
<Trans scribe="Mark Liberman" audio_filename="Cookie1" version="1" version_date="160131">
<Speakers>
<Speaker id="spk1" name="Subject" check="no" dialect="native" accent="" scope="local"/>
<Speaker id="spk2" name="END" check="no" dialect="native" accent="" scope="local"/>
</Speakers>
<Episode>
<Section type="report" startTime="0" endTime="77.980">
<Turn startTime="0" endTime="7.454">
<Sync time="0"/>

</Turn>
<Turn speaker="spk1" startTime="7.454" endTime="73.086">
<Sync time="7.454"/>
Uh
<Sync time="7.856"/>

<Sync time="8.025"/>
mother is uh washing dishes and- while the kids are
<Sync time="12.447"/>
[...omitted stuff...]
water.
</Turn>
<Turn speaker="spk2" startTime="73.086" endTime="77.980">
<Sync time="73.086"/>

</Turn>
</Section>
</Episode>
</Trans>

And as a first step, we want an output that looks like this:

nobody 0 7.454 SILENCE
Subject 7.454 7.856 Uh
Subject 7.856 8.025 SILENCE
Subject 8.025 12.447 mother is uh washing dishes and- while the kids are
[...omitted stuff...]
Subject 72.599 73.086 water.
END 73.086 77.98 SILENCE

That is, an output where each line consists of four pieces of information about each timed segment

SPEAKER STARTTIME ENDTIME TEXT

and SPEAKER is "Subject" for the line we care about.

And then we'll want to turn it into something like what the SAD program put out:

7.454 7.856 spch
7.856 8.025 nonspch
8.025 12.447 spch
[...omitted stuff...]
72.599 73.086 spch

As I said, we'll do this in three steps. That's partly to create a gentler learning curve, and partly because it is often simpler to do such things in stages anyhow. I'll write the first two steps for you -- though you might want to look at the code to see how it works -- and give you a template for the last step.

untrs1.pl: Since .trs files are xml, which is a way of writing tree-structured data as labelled brackets, we might want to start with a program that knows how to parse xml. We could use XLST, or one of the perl modules for parsing xml, like XML::Simple, or Python's ElementTree. But this won't help a lot with the .trs files, since the text and timing information is laid out in the tree in an unhelpful way. (If you're interested in such things, you might think about how to use xml to express the relevant information in an easier-to-process fashion...)

So I've written a simple program untrs1.pl (in /usr/local/bin on harris), which processes .trs files one line (one tag or one text sequence) at a time, and produces something like this:

$ untrs1.pl Cookie1.trs
TURN 0 7.454 nobody
SYNC 0
TEXT SILENCE
TURNEND
TURN 7.454 73.086 Subject
SYNC 7.454
TEXT Uh
SYNC 7.856
TEXT SILENCE
SYNC 8.025
TEXT mother is uh washing dishes and- while the kids are
SYNC 12.447
[...omitted stuff...]
TEXT uh
SYNC 71.993
TEXT SILENCE
SYNC 72.599
TEXT water.
TURNEND
TURN 73.086 77.98 END
SYNC 73.086
TEXT SILENCE
TURNEND

untrs2.pl: In the next step, untrs2.pl, we need to accumulate four pieces of information for each timed segment:

    1. Speaker
    2. Start time
    3. End time
    4. Text

And we have to deal with a few small problems:

First, SYNC lines can mark both start times and end times of segments, and so we need to keep track of where we are in the cycle of SYNC-TEXT-SYNC lines. Second, the speaker information comes only in the TURN lines, which are not repeated for each TEXT line, so we need to remember who the current speaker is, and use this information for every segment that we output. And third, when we see a TURNEND line, we need to use the end time given in the corresponding earlier TURN line, rather than waiting for the next SYNC line.

So there are four states or stages:

    1. Waiting for TURN information
    2. Within a turn, waiting for start time SYNC
    3. Have start time, waiting for TEXT information
    4. Have start time and text, waiting for end time SYNC or TURNEND

I've written untrs2.pl to keep track of which state we're in, and to complain if the input it sees doesn't match what it expects. The result looks like this:

$ untrs1.pl Cookie1.trs | untrs2.pl
nobody 0 7.454 SILENCE
Subject 7.454 7.856 Uh
Subject 7.856 8.025 SILENCE
Subject 8.025 12.447 mother is uh washing dishes and- while the kids are
Subject 12.447 12.694 SILENCE
[...omitted stuff...]
Subject 69.394 69.919 uh
Subject 69.919 70.096 SILENCE
Subject 70.096 71.506 she spilled the
Subject 71.506 71.689 SILENCE
Subject 71.689 71.993 uh
Subject 71.993 72.599 SILENCE
Subject 72.599 73.086 water.
END 73.086 77.98 SILENCE

Note that I've used two pseudo-speakers "nobody" and "END" -- that doesn't matter in this case, since we only care about text whose speaker is "Subject". If there were some dialogue with the interviewer, I'd suggest using the speaker ID "Interviewer" for the interviewer's turn(s), and the pseudo-speakers "S2I" and "I2S" for the silences (if any) between the subject's speech and the interviewer's speech.

untrs3.pl: Now the last step is to

  1. Ignore lines whose speaker ID is not "Subject"
  2. Map the text regions that we've labelled "SILENCE" to "nonspch"
  3. Map non-SILENCE text regions to "spch"
  4. For each line, print out the starttime, endtime, and "nonspch" or "spch" classification.

Here's the framework for a perl program that will do the job:

#!/usr/bin/perl
while(<>){
    if(/^Subject/){
	    # SPLIT THE LINE INTO FIELDS AT SPACES
	    # PRINT THE 2nd AND 3rd FIELDS (i.e. start and end times)
	    if($fields[3] eq "SILENCE"){ 
	        # PRINT THE CLASSIFICATION "nonspch"
	    } else {
	        # PRINT THE CLASSIFICATION "spech"
	    }
    }
}

Put the above text in a file ~/bin/untrs3.pl (i.e. in /home/YOURID/bin/untrs3.pl), and make that file executable, e.g. via

$ chmod +x ~/bin/untrs3.pl

Now replace the four comments (lines following #) with perl code that does what the comments say. (Or better, leave the comments there and add lines of code to carry out the suggested tasks...)

Relevant documentation: split, print -- or look at the uses of split and print in /usr/local/bin/untrs2.pl ...

When you're done, you should be able to do

$ untrs1.pl Cookie1.trs | untrs2.pl | untrs3.pl
7.454 7.856 spch
7.856 8.025 nonspch
8.025 12.447 spch
12.447 12.694 nonspch
12.694 16.37 spch
16.37 17.07 nonspch
[...stuff omitted...]
71.689 71.993 spch
71.993 72.599 nonspch
72.599 73.086 spch

And for convenience, you could put a script in your ~/bin directory to do all three at once -- say under the name trs2lab:

#!/bin/sh
untrs1.pl $1 | untrs2.pl | untrs3.pl

And then e.g.

$ trs2lab Cookie1.trs

will do the whole job.