LING521

Selecting and Inspecting a Sample of Examples

Background

It's often helpful to look at (and listen to) a set of examples of some phenomenon. These might be all the recordings in some small collection, or a designated subset of some larger collections, or a random sample from an even larger one.

Unfortunately, there's still no generally-adopted format for such datasets, and no generally-available interactive method for accomplishing the desired selection and inspection. So for now, you need to program your own way through the problem. Since the relevant datasets are rather diverse in terms of layout and file formats, the necessary scripting is much less trivial than it should be. But the general approach should generalize, and the implementation will get easier over time as the field progresses.

(Of course, we really want our friend the AI Phonetician to do all the looking and listening. And sometimes our friend can do that already -- but even when they can, we usually need some human exploration to design their task and check a sample of its results.)

This page aims to lead you through a couple of looking-and-listening examples relevant to our TenseTents exercise. The general approach is has three steps:

  1. Find all the examples of the phenomenon of interest in the dataset;
  2. Make a random selection of appropriate size;
  3. Create (and run) a script for inspecting (and perhaps classifying) the individual cases in the selection.

It'll often be the case that the first step is imperfect -- for example, if all that you have in a time-aligned orthographic transcription, finding phonetically and/or morpho-syntactically characterized examples will require some ingenuity and will not have perfect results. And the phone-level description from today's forced-alignment systems generally tries to apply dictionary-derived segment sequences whose phonetic correspondence with the speech stream is variable at best.

But in general this is OK -- even if the result of your search is ore of relatively low quality, due to lots of "false alarms" in step 1, your inspection in step 3 can still extract a pure sample without a great deal of work. (Though you should be careful of possible bias introduced by "miss" errors in step 1...)

Example 1

We'll explore the realization of (presumptive) /nts#/ and /ns#/ sequences in various datasets, and for comparison perhaps /nt#s/, /n#t/, and /n#s/.

The first step is to see how to do this in a simple case, namely the TIMIT dataset -- we'll consider some larger and more complex cases later on.

The TIMIT dataset in on Harris in /plab/timit1. In order to look and listen interactively, you'll need a local copy -- you can do this by downloading (from Harris) and unzipping /plab/L521/timit1.zip. After unzipping, this will take approximately 700 MB of local storage -- if this isn't available on your own machine internal disk, you could use a thumb drive or similar.

One approach is to create a Praat script that will lead us through the examples one at a time. (This is unfortunately harder than it should be, because of the poor design of Praat's scripting language, the poor design of Praat's TextGrid format, and the fact that Praat forces us to choose between loading short and long audio files.) Later we'll try the same thing with the Emu SDMS.

We'll set up each phase of the TIMIT Praat experiment in 3 steps:

  1. Create a text file in which each line specifies a sentence ID and a target word, chosen to illustrate a particular phonological pattern.
    TIMIT has 2 SA sentences, with 630 recording each;  474 SX sentences, with 7 recordings each; and 1890 SI sentences, with one recording each.
  2. Run a script that finds all the corresponding file names, with the start and end times of the cited word in the each cited file, and put this in a second text file. 
  3. Run a program that turns (2) into a suitable Praat script for checking the selections, if we tell it where to find the audio files and the corresponding textgrids.

Let's illustrate this with a simple example. There are six TIMIT sentences containing the word "sense":

SI1128 Others invoked technology and common sense.
SI1208 This doctrine was repugnant to my moral sense.
SI1402 We do not arrive at spatial images by means of the sense of touch by itself.
SI1410 We will achieve a more vivid sense of what it is by realizing what it is not.
SI2248 She always could sense the shag end of a woolly day.
SX372 That diagram makes sense only after much study.
	  

So out input file sense1.in will look like this:

SI1128 sense
SI1208 sense
SI1402 sense
SI1410 sense
SI2248 sense
SX372 sense

Then in a suitable directory of your own on Harris -- e.g. ~/research/TenseTents -- you can run a command like

MakeTimitScript2 '~/data/timit1' sense1 < sense1.in

MakeTimitScript2 is a script in /usr/local/bin (see a copy here) that knows where the TIMIT data is on Harris, and how it's laid out. We've given it two arguments: the path to the TIMIT directory on your local disk (NOT on Harris), and an arbitary "CASE ID" for naming and keeping track of the various output files.

(Note that if the TIMIT-directory argument uses the tilde abbreviation for your home directory, the argument needs to be put in single quotes to prevent Harris from substituting the path to your home directory on that machine. Alternatively you could give the local pathname in full.)

The program will respond

script in  sense1.praat  -- you can add notes to  sense1.praatnotes

Along the way, it will have created a file sense1.locs with the file names and time spans:

MPAB0_SI1128 3.278 3.787
FKAA0_SI1208 2.337 2.857
FTBR0_SI1402 2.697 3.091
MPGR0_SI1410 1.329 1.691
FGDP0_SI2248 1.213 1.568
MCRC0_SX372 0.903 1.180
MESD0_SX372 1.107 1.590
MLNT0_SX372 1.099 1.454
MMAB0_SX372 1.010 1.327
MMDM2_SX372 0.998 1.338
MPDF0_SX372 1.155 1.569
MRJR0_SX372 1.091 1.520

The (excessively complicated Praat-specific) heavy lifting will have been done by another script in /usr/local/bin, seq2script1, which I've linked here in case you want to look at its innards (though you should hope never to have to...).

Now all you need to do is to copy sense1.praat and sense1.praatnotes to your local machine, and proceed as you have before.

To Come...

More of the same in TIMIT -- but then analogous explorations in some MUCH larger datasets, including 1500 hours of audiobooks and 10,000 hours of NPR podcasts...