Teaching | SpinTX

5 Ways to Open Up Corpora for Language Learning

Corpora developed by linguists to study languages are a promising source of authentic materials to employ in the development of OER for language learning. Recently, COERLL’s SpinTX Corpus-to-Classroom project launched a new open resource that seeks to make it easy to search and adapt materials from a video corpus.

The SpinTX video archive provides a pedagogically-friendly web interface to search hundreds of videos from the Spanish in Texas Corpus. Each of the videos is accompanied by synchronized closed captions and a transcript that has been annotated with thematic, grammatical, functional and metalinguistic information. Educators using the site can also tag videos for features that match their interests, and share favorite videos in playlists.

A collaboration among educators, professional linguists, and technologists, the SpinTX project leverages different aspects of the “openness” movement includingopen research, open data, open source software, and open education. It is our hope that by opening up this corpus, and by sharing the strategies and tools we used to develop it, others may be able to replicate and build on our work in other contexts.

So, how do we make a corpus open and beneficial across communities? Here are 5 ways:

1. Create an open and accessible search interface

Minimize barriers to your content. Searching the SpinTX video archive requires no registration, passwords or fees. To maximize accessibility, think about your audience’s context and needs. The SpinTX video archive offers a corpus interface specifically for educators, and plans to to create a different interface for researchers.

2. Use open content licences

Add a Creative Commons license to your corpus materials. The SpinTX video archive uses a CC BY-NC-SA license that requires attribution but allows others to reuse the materials different contexts.

3. Make your data open and share content

Allow others to easily embed or download your content and data. The SpinTX video archive provides social sharing buttons for each video, as well as providing access to the source data (tagged transcripts) through Google Fusion Tables.

4. Embrace open source development

When possible, use and build upon open source tools. The SpinTX project was developed using a combination of open source software (e.g. TreeTagger,Drupal) and open APIs (e.g. YouTube Captioning API). Custom code developed for the project is openly shared through a GitHub repository.

5. Make project documentation open

Make it easy for others to replicate and build on your work. The SpinTX team is publishing its research protocols, development processes and methodologies, and other project documentation on the SpinTX Corpus-to-Classroom blog.

Openly sharing language corpora may have wide-ranging benefits for diverse communities of researchers, educators, language learners, and the public interest. The SpinTX team is interested in starting a conversation across these communities. Have you ever used a corpus before? What did you use it for? If you have never used a corpus, how do you find and use authentic videos in the classroom? How can we make video corpora more accessible and useful for teachers and learners?

Category

Corpus Applications

Corpus Methods

SpinTX Video Archive (Beta) Has Launched!

Category

Project Updates

Brainstorming on the search & browse interface

We are thinking of offering teachers a practical and user friendly way of accessing the video clips in the SPinTX corpus. We are assuming that teachers might sometimes be overwhelmed by what can be asked to a corpus query interface (i.e., they did not design the compilation process, and it can be just a small corpus — compare to Google, querying the entire web).

Thus we want to offer teachers two clip retrieval modes: the search mode and the browsing mode. The search mode is the usual Google-like key term based search. I would type “banco Medellín” to retreive documents related to banks (financial institutions) in Medellín (Colombia). However, I would type “banco madera Medellín”, if I were looking for documents about carpenters or stores selling wooden banks (to sit on) in Medellín.

The browsing functionality is intended to facilitate the visual exploration of pedagogically relevant information extracted from the corpus. One initial thought is the use of information clouds, as reflected in the figure below. Imagine a a blank square with two drop-down menus. On one of them you could select a topic, to determine the lexical goal, the vocabulary. On the other one you could select the linguistic topic, which could range from grammatical categories to functional ones and a range of other classification criteria that could be relevant for language instruction/learning.

Figure 1 shows how this particular strategy would look like if we select Todos (all topics) in the thematic dropdown list and Gram: Prep. régimen (grammar topic, verb and preposition combinations). The size of the particular verb+prep combination is related to the number of occurrences it has in the corpus now, though it could also be related to the number of documents that have it in the corpus too.

Figure 1. Wireframe of a user interface for browsing the corpus information on the basis of thematic criteria and linguistic criteria.

Category

Inspiration

Project Updates

From Transcript to Tagged Corpus

In this post I will discuss the steps that we are using to get from our transcripts to our final corpus (as of 01/15/2013). This is still a messy process, but with this documentation anyone should be able to replicate our output (on a Mac).

Step 1. Download and unzip this folder where you would like to do your work.

Step 2. Install TreeTagger within ProjectFolder/TreeTagger (look inside the folder you just unzipped).

Step 3. Make sure that you have updated, complete versions of PHP and Python installed.

Step 4. Update TranscriptToSrt.py and SrtGatherer.py with your YouTube client id, secret, and developer key.

Step 5. Save your plain-text transcripts in Project/transcripts (one for each video).

Step 6. Update MainInput.txt with your information.

Step 7. Log in to your YouTube account.

Step 8. Open Terminal and navigate to ProjectFolder.

Step 9. Run MainBatchMaker.py by typing: python MainBatchMaker.py

Step 10. Run MainProcessor by typing: ./MainProcessor

And you’re done! You should now have fully tagged files in ProjectFolder/Processing/Tagged and closed caption files in ProjectFolder/Processing/SRT. And next time you’ll only need to do steps 5 – 10! ?

A few hints in case you run into trouble:

You may need to install some additional Python libraries as indicated by any relevant errors.

If you have an encoding error with some of the Spanish characters, you may need to edit srtitem.py. See my comment on StackOverflow.

If the scripts are successful at downloading some srt files from YouTube, but not others, it is probably a timing issue with YouTube’s API. I am currently trying to build in a work-around, but for now, just wait a few minutes, run MainProcessor again, and cross your fingers.

Finally, these scripts are not very efficient yet. When running them with around 30 videos and around 100,000 words, it takes about two hours on my MacBook Pro. Sorry about that. We will be working on optimizing these scripts as time permits.

Please contact me with any questions or suggestions!

Category

Corpus Tools

Automated captioning of Spanish language videos

By the end of the summer, we expect the Spanish in Texas corpus will include 100 videos with a total running time of more than 50 hours. Fortunately, there are a range of services and tools to expedite the process of transcribing and captioning all those hours of video.

YouTube began offering automated captioning for videos a few years ago. Using Google’s voice recognition technology, a transcript is automatically generated for any video in one of the supported languages. As of today those languages include English, Japanese, Korean and Spanish, German, Italian, French, Portuguese, Russian and Dutch. The result of the automated transcription is still very much inferior to human transcription and is not usable for our purposes. However, YouTube also allows the option of uploading your own transcript as the basis for generating the synchronized captions. When a transcript is provided, the syncing process is very effective at creating accurate closed captions synchronized to a video. In addition, YouTube offers a Captioning API, which allows programmers to access the caption syncing service from within other applications.

Automatic Sync Technologies is a commercial provider of human transcription services as well as a technology for automatically syncing transcripts with media to produce closed captions in a variety of formats. Automatic Sync recently expanded their service to include Spanish as well as mixed Spanish/English content. An advantage of using their service is that they have the ability to create custom output formats (requires a one-time fee). For instance, we worked with them to create a custom output file that included the start and end time for each word in the transcript and was formatted as a tab-delimited text file.

There are also online platforms for manually transcribing and captioning videos in a user-friendly web interface. DotSub leverages a crowd-sourcing model for creating subtitles and then translating the subtitles into many different languages. Another option in this category is Universal Subtitles, which is the platform used to subtitle and translate the popular TED Video series. These can be a good option if resources aren’t available to hire transcribers and/or translators.

While developing the SPinTX corpus we have used all of the solutions mentioned above, but we have now settled on a standard process that works best for us. First, we pay a transcription service to transcribe the video files in mixed Spanish / English and provide us with a plain text file, at a cost of approximately $70 per hour of video. Then, we use the YouTube API to sync the transcripts with the videos and retrieve a caption file. This process works for us because our transcripts often need a lot of revisions, and we can sync as many times as we need at no cost. The caption file is then integrated into our annotation process, so when users get search results they can jump directly to the place it occurs in the video. In a later post, we will go into more detail about how we are implementing the free YouTube API and how you can adapt this process for your own video content!

Category

Corpus Tools

¿Qué criterios usarías para buscar vídeos?

[N.B. Información previa sobre el corpus abajo mencionado: post anterior y site de SPinTX, ambos en inglés]

Pregunta para los que enseñáis Español como Lengua Extranjera (ELE): Cuando buscáis en Internet un vídeo para trabajar un objetivo gramatical o léxico específico, ¿qué tipo de criterios de búsqueda crees que os serían útiles? Estamos tratando de añadir metadatos al corpus de Español de Texas (SPinTX) y hemos empezado a hacer una pequeña lista (adjunta a continuación). ¿Tienes cinco minutos para darnos tu opinión? ¡Déjanos un comentario, por favor!

Lista de descriptores pedagógicos para SPinTX

Nivel morfológico
- Tiempos verbales: presentes, pretéritos, futuros, condicionales, etc.
- Modo verbal: indicativo, subjuntivo, imperativo, infinitivo, gerundio, etc.
Nivel morfosintático
- Género en sustantivos y combinación con determinantes.
- Uso de preposiciones.
  - Por y para: distinción entre usos causales, objetivos, destinos, destinatarios.
Nivel discursivo
- Marcadores discursivos.
Nivel léxico
- Identificar los campos semánticos de un texto a través de una lista de palabras clave.
Nivel funcional
- Expresar gustos y preferencias.

Si habéis llegado aquí, una pregunta más: ¿os imagináis una ficha técnica asociada a cada uno de los vídeos de una lista de resultados con este tipo de información para poder filtrar los más o menos adecuados para vuestra clase?

Category

Corpus Tools

Inspiration

State of the Corpus

One of the questions that is most frequently asked is: How big is your corpus? The answer is: Beats me, its constantly changing and there are several different versions of the corpus available at any one time. But people usually aren’t satisfied with that answer, so here are the details of where the SPinTX corpus currently stands to the best of my knowledge (as researched this morning):

Total n interviews: 123

Total n transcripts: 74

Total n words: 315,673

Total n transcripts approved and tagged: 32

Total n words for approved and tagged transcripts: 134,737

Total n clips available to public taken from approved videos: 328

Total n words for clips: 102,573 (Note: many of the clips overlap, this is not filtered out in this count.)

Please let me know if there are any other stats that would be of use/interest and I will append them to this post.

-Cheers, Arthur

Category

Project Updates

Designing a pedagogical interface for a repository of video interviews

One of the goals in the Corpus to Classroom project is to design a pedagogical interface for the repository of video clips that are being generated out of the more than 100 interviews that were collected in the past as part of the Spanish in Texas project. From our interviews with actual teachers and materials developers, we confirmed that teachers are potentially interested in applying the following types of filtering criteria to their searches:

Grammar topics: e.g., search for those clips that contain a significant number of occurrences of por and para
Functional topics: e.g., search for those clips that contain exponents of the function apologizing
Vocabulary: e.g., clips that contain words (in a pre-defined list maybe) that relate to the topic la familia (papá, mamá, padre(s), madre, hermano/a, abuelo/a…)
Thematic: e.g., clips talking about food, traditions, reasons for moving to the US (in our case)…

This is not a complete list, but it is a starting one that contains the most common types of criteria (emotion and phonetics are two criteria that were mentioned too).

With this in mind we are considering the use of a standard search engine (such as Apache Solr/Lucene) to allow teachers to search for the clips and use facets (filtering options) to dig down or define finer-grained queries. However, we also consider the use of typical corpus query tools (such as CWB or SketchEngine — or NoSketchEngine). With this we can cover the Information Retrieval part of our task (more appropriate for document retrieval on the basis of word- or term-based queries) and the Information Extraction part of our task (more appropriate for the queries driven by linguistic patterns).

We will further describe our advances in future posts.

Category

Project Updates

LIFT off!

This blog will chronicle the development of the SPinTX Corpus, and our work to bring a pedagogically useful corpus of authentic Spanish and bilingual Spanish-English speech samples into language classrooms across Texas. The Spanish in Texas (SPinTX) Project project was selected to receive funding from the Longhorn Innovation Fund for Technology (LIFT) for the grant period September 1, 2012 – August 31, 2013. Development of the Corpus began in 2010 and is ongoing under the auspices of the Title VI Center for Open Educational Resources and Language Learning (COERLL).

The focus of the project over the next year will be to help educators exploit the SPinTX corpus to customize materials for the teaching of Spanish at all educational levels. The aims of the project are:

to develop a pedagogically friendly interface for the corpus;
to involve teachers and learners, via crowd-sourcing, social networking, and workshops, in the development of open educational resources (OER); and to
develop a model for using open source tools and a pedagogical interface that can be adapted for any language corpus.

In the spirit of openness, we will be sharing and discussing what we learn and create throughout the project. We invite you to join with us as we explore new tools and methods for integrating authentic content and open data into the language classroom!

Category

Project Updates

Authentic Spanish videos for language learning

Filter by:

Posts

5 Ways to Open Up Corpora for Language Learning

1. Create an open and accessible search interface

2. Use open content licences

3. Make your data open and share content

4. Embrace open source development

5. Make project documentation open

SpinTX Video Archive (Beta) Has Launched!

Brainstorming on the search & browse interface

From Transcript to Tagged Corpus

Automated captioning of Spanish language videos

¿Qué criterios usarías para buscar vídeos?

State of the Corpus

Designing a pedagogical interface for a repository of video interviews

LIFT off!