Skip to main content
Spanish in Texas Project
Not Logged In Menu
  • Log in
  • Register
Main navigation
  • Home
  • About
  • Lesson Ideas
  • Teaching
  • Contact
SpinTX
Authentic Spanish videos for language learning

Filter by:

Clear all filters

Posts

Using the Content
Immigration
“Gustar-Type Verbs” with the Subjunctive
Bringing Authentic Spanish Videos into the Classroom
SpinTX to the Rescue
Example (Lengthy) Activity with the Subjunctive
SpinTX in use in an intermediate Spanish class
Preparing to conduct and film an interview
SpinTX Project Featured in COERLL Summer Webinar Series
Using VISL Constraint Grammar to pedagogically annotate oral text
5 Ways to Open Up Corpora for Language Learning
SpinTX Video Archive (Beta) Has Launched!
Brainstorming on the search & browse interface
From Transcript to Tagged Corpus
Automated captioning of Spanish language videos
¿Qué criterios usarías para buscar vídeos?
State of the Corpus
Designing a pedagogical interface for a repository of video interviews
LIFT off!
Category

From Transcript to Tagged Corpus

In this post I will discuss the steps that we are using to get from our transcripts to our final corpus (as of 01/15/2013).  This is still a messy process, but with this documentation anyone should be able to replicate our output (on a Mac).

Step 1. Download and unzip this folder where you would like to do your work.

Step 2. Install TreeTagger within ProjectFolder/TreeTagger (look inside the folder you just unzipped).

Step 3. Make sure that you have updated, complete versions of PHP and Python installed.

Step 4. Update TranscriptToSrt.py and SrtGatherer.py with your YouTube client id, secret, and developer key.

Step 5. Save your plain-text transcripts in Project/transcripts (one for each video).

Step 6. Update MainInput.txt with your information.

Step 7. Log in to your YouTube account.

Step 8. Open Terminal and navigate to ProjectFolder.

Step 9. Run MainBatchMaker.py by typing: python MainBatchMaker.py

Step 10. Run MainProcessor by typing: ./MainProcessor

And you’re done!  You should now have fully tagged files in ProjectFolder/Processing/Tagged and closed caption files in ProjectFolder/Processing/SRT.  And next time you’ll only need to do steps 5 – 10!  ?

 

A few hints in case you run into trouble:

You may need to install some additional Python libraries as indicated by any relevant errors.

If you have an encoding error with some of the Spanish characters, you may need to edit srtitem.py.  See my comment on StackOverflow.

If the scripts are successful at downloading some srt files from YouTube, but not others, it is probably a timing issue with YouTube’s API.  I am currently trying to build in a work-around, but for now, just wait a few minutes, run MainProcessor again, and cross your fingers.

Finally, these scripts are not very efficient yet.  When running them with around 30 videos and around 100,000 words, it takes about two hours on my MacBook Pro.  Sorry about that.  We will be working on optimizing these scripts as time permits.

Please contact me with any questions or suggestions!

Category

Corpus Tools

Coerll Logo

Texas Logo

Creative Commons License SpinTX is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.