Ffmpeg Extract Ass Subtitle

Extract subtitles with ffmpeg from a .ts video file

The current default is 'asswithtimings' for compatibility. This means that: all subtitles text decoders currently still output ASS with timings printed: as strings in the AVSubtitles.rectsN-ass fields. Setting 'subtextformat' to 'ass' allows a better timing accuracy (ASS: timing is limited to a 1/100 time base, so this is relevant for.

As I understand it video and audio codecs can be implied from file extension, but subtitles require explicit definition. If you add in the codec specification to srt then it should work. Ffmpeg -i subs.ass -c:s srt output.srt 1.
Ffmpeg -i input.mkv -c copy -c:s movtext output.mp4. Example to stream copy all of the video and audio streams, convert the all text based subtitle input streams (SRT, ASS, VTT, etc) to the streaming text format, and set the language for the first two subtitle streams. Ffmpeg -i input.mkv -map 0 -c copy -c:s movtext -metadata:s:s:0 language.
Subtitles are pretty much the same. The subtitles will be printed in the info and then you can extract them, similar to: ffmpeg -threads 4 -i VIDEO.mkv -vn -an -codec:s:0.2 srt myLangSubtitle.srt 0.2 is the identifier that you have to read from the info.
Example to stream copy all of the video and audio streams, convert the all text based subtitle input streams (SRT, ASS, VTT, etc) to the streaming text format, and set the language for the first two subtitle streams. Ffmpeg -i input.mkv -map 0 -c copy -c:s movtext -metadata:s:s:0 language=eng -metadata:s:s:1 language=ipk output.mp4.

This post chronicles my ultimately failed attempt to extract subtitles with ffmpeg / avconv from a .ts DVB-S video-file recorded from live television.

This is what ffmpeg has to say about the file I wanted to extract subs from

I started with a search for anyone else having tried that and found many pointers on stackoverflow and such, but all were referring to .srt subtitling. As it seems a lot of people are transcoding, remuxing and repackaging their anime, but only very few people try to extract subs from broadcasted video streams. For example, the ffmpeg docs provide this example.

My ffmpeg can decode and encode all we'd need, right?

Wrong! When you do a naive conversion with ffmpeg, extracting the dvb_sub program-stream from the .ts file and sending it to, let's say, an .srt file, you'll get this dreaded ffmpeg / avconv error: 'Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height'
The explanation behind this is: there are basically two formats in subtitling - image and text based - and most subs discussed on the web are text based, like .srt, .stl, .webvtt; but we here are facing an image-based sub format! (This user here (Is it possible to extract SubRip (SRT) subtitles from an MP4 video with ffmpeg?) had the same learning curve.)
The MPEG legacy has brought us here, I think, because on DVDs and obviously in .ts broadcast streams subs are in image-based formats, VobSub, dvb_sub, dvd_sub.
To see what I'm talking about, refer to vlc's comparison of subtitle formats, as Wikipedia doesn't have one.

For example, try something like this with a similar file:

ffmpeg -i 000.ts -map 0:0 -vcodec copy -acodec copy -map 0:s:1 -scodec dvdsub test.mkv

Just exchange dvdsub with srt for example, and you'll get the above mentioned error. What the command does, on the other hand, is transcoding the subsitles found in the second subtitle stream (streams are zero based, so '1' is the second; which is stream 0:6 in this file, and 0:5, the first sub-stream has teletext). It's beyond me what the actual difference between dvb_subtitles and dvd_subtitles is. But when you watch the result in vlc (mplayer has problems displaying them..), the quality of the subs has degraded. (But this detail only as a side-note, I think it's from DVB Subs being in 16bit BMP / PGM format, while the transcode wrote DVD Subs in 4 bit bitmaps, losing quality in alpha or similar).

Ffmpeg Extract Pgs Subtitles

With that learned and at least some working command for ffmpeg, I stumbled over this mailing-list post (Can extract DVB-Sub, cannot extract DVB-T), where someone had issues with dumping subtitles from a .ts file. And it gave me this command, which actually worked in dumping the raw subtitles-only stream to a file: ffmpeg -i file1.ts -vn -an -scodec copy -f rawvideo dvbsub.datThis post here discusses something similar. My variation of it was

ffmpeg -i 000.ts -map 0:0 -vcodec copy -acodec copy -map 0:s:1 -scodec copy -f rawvideo sub.data

But looking at the file with a hex editor, with me being a hex noob, brought nothing resembling a BMP or PGM file, any headers or structures I'd recognise.

So how can I extract image-based subtitles with ffmpeg? First I tried if piping a substream to an image format would work:

ffmpeg -i 000.ts -map 0:0 -vn -an -map 0:s:1 -scodec copy -f image2 sub_%03d.bmp

and this actually wrote many images but all were unusable. I don't know how ffmpeg actually chopped the data into files, based on timecode-subtitle-triggers?? I don't know. Finally giving up on ffmpeg, I asked the search engine if any other tools were able to extract image-based subtitles as rendered images/pictures from the video. Some claimed mencoder could do that and I actually found example commands, but none worked for me and all examples centered around DVD and VobSub format type of work, like writing .idx and .sub files from DVD etc.

This post then, although discussing a VOB workflow has pointers into the only feasible way of converting an image-based sub-stream into something text-based or into raw text. There the author used mencoder and a tool called vobsub2pgm and finally sends the resulting character images into an OCR solution. This post does something similar and uses tesseract for OCR. Ffmpeg can't do that, and so far only the other way round, encoding/rendering textual chars as images has been mentioned for ffmpeg in this ticket to add rasterization for sub transcoding.

Just before giving up, as I wasn't inclined to go a painful console-based path of trial and error with multiple tools just to extract some subtitles, I found that AviDemux offers a GUI tool to do just that, OCR'ing of image-based subs! Found in the '> Tools' menu, the older routine is called 'OCR (VOBSub -> srt)' and more recent avidemux builds have 'OCR (TS->srt)'.

Running this on my .ts file didn't work. If you're interested, this page has screenshots of the workflow, which is a bit cumbersome, as OCR is not perfect and you eventually have to edit what's being recognised.

All I got was a weird error 'backdoor) >> 16' and something which brought me to this thread, which once more mentioned a DVB tool called ProjectX. Despite the generic name, it's a very dedicated tool, focusing on inspecting and decoding DVB style .ts files as streamed by European broadcasters. And users on the forums say it's able to extract subtitles from the video mux.

And although it processed my .ts file, and printed all sorts of very involved looking things about packets, streams and elements found in the stream, I was not successful in properly targeting and extracting the 'sub-picture' teletext subtitles stream found in my MPEG transport stream file. And that's the end of it. Post a comment when you have tried something similar and can provide pointers.

For keywords:
How to extract subs with avconv
Dump subtitles with ffmpeg
Extract subs with ffmpeg and write images / bitmaps / ocr
How to write subtitles from a video into a separate srt stl subtitles file

I became a huge subtitle user when I met my wife. We both like to watch a lotof non-English/non-Chinese movies, and while I use English subtitles, sheprefers to have subtitles in Chinese most of the time since she can read itfaster than English.

Over the years of doing this I've acquired quite a lot of knowledge in thisarea, and built quite a few tools to help. This post is a way of introducingthem to the world, and hopefully it will help anyone in a similar predicamentto mine.

Things you will need:

ffmpeg, a set of tools to manipulate multimedia data
srt, a Python library and set of tools I've written for dealing with SRTfiles (install with pip install srt)

Conversion from other formats to SRT

The SRT format is by far my favourite subtitle format. Its spec has itsoddities (not least that there is no widely accepted formal spec), but ingeneral if you stick to the accepted commonalities of the format between mediaplayers, you'll find it's not only simple, but easy to modify and scriptaround.

If you have another format, like SSA, for example, you'll probably findthat ffmpeg does a pretty good job converting it with ffmpeg -i foo.ssafoo.srt.

Acquiring subtitles

I won't go into too much detail on this, since you probably will have goodenough luck Googling '[movie] [language] subtitle', but here are somerecommendations:

If you want to extract existing subtitles that are already in your video file(for example, to mux them with other ones), see Extracting subtitles from avideo file, below. This is oftenthe best way since these have already been checked to work with the versionof the movie you have.
For Chinese subtitles, Shooter is pretty good and frequently updated.
Otherwise, Google '[movie] [language] subtitle'.

Fixing encoding problems

All of the SRT tools take UTF-8 as input, since it's a sane, reasonableencoding across the board. You may find that your subtitles are not encoded asUTF-8 and require conversion.

Let's take Chinese subtitles as an example, as they often use country-preferredencoding schemes. Chinese subtitles usually come encoded as Big5 orGB18030.

I personally find that enca is pretty good at detecting the encoding andconverting it appropriately. You can call it as enca -c -x UTF8 -L <languageiso code> <sub> to convert subtitles to UTF-8 based on encoding detectionheuristics, regardless of their source encoding.

Extracting subtitles from a video file

I'll assume you're using a Matroska file, since they're so popularnowadays, but much of this will also apply elsewhere.

Inside an MKV file are multiple streams. They contain things like the videodata, the audio data, and subtitles. You can list them with ffprobe:

Looking at the three streams marked 'Subtitle', you can see that we haveEnglish, Spanish, and French subtitles available in this MKV.

Say you want to extract the Spanish subtitle to an SRT file. When converting,ffmpeg will pick the first suitable stream that it finds – by default, then,you will get the English subtitle. To avoid this, you can use -map to selectthe Spanish subtitle for output.

Ffmpeg Extract Vobsub

In this case we know that the Spanish subtitle is stream 0:4, so we run thiscommand:

We can see that the right subtitle has been selected:

We will use this subtitle for most of the subsequent examples.

Stripping HTML-like entities from subtitles

As you can see in the subtitle above, sometimes subtitles contain HTMLentities, like <b>, <color>, etc. These are not part of the SRTspec, they remain to be interpreted by the media player. Since not all mediaplayers support this sometimes they are just shown raw, which looks quite bad.

The srt project contains a tool to deal with this called process, which canperform arbitrary operations on files:

Correcting time shifts

Getting subtitles from the internet is an imperfect business. There are often afew different packagings of a movie in different markets, some with differentintros, some from different original sources, etc. This can result in thesubtitles requiring some correction prior to use.

Your media player may contain some rudimentary controls to correct this atruntime, which may suffice for fixed timeshifts, but for linear timeshifts andcases where you need two sync two subtitles exactly prior to muxing, modifyingthe SRT file directly is a good idea.

The srt project contains two tools to deal with this:

fixed-timeshift, which shifts all subtitles by a fixed amount. Forexample, you may want to shift all subtitles back a certain number of secondsto sync properly with your video.
linear-timeshift, which takes two existing time points in the input, andscales all subtitles so that those time points are shifted to the correctvalues. For example, if you had three subtitles with times 1, 2, and 3, youset the existing times as 1 and 3, and you set the new times as 1 and 5, thenew times for those subtitles would be 1, 3, and 5.
On the command line, 'f' means 'from', 't' means 'to', and the numbers arejust the unique ID for each pair of times.

Muxing subtitles together

The srt project contains a tool, mux, that takes multiple streams of SRTsand muxes them into one. It also attempts to clamp multiple subtitles to usethe same start/end times if they are similar (by default, if they are within600ms of each other), in order to stop subtitles jumping around the screen whendisplayed.

Say we wanted to create Spanish/French dual language subs for this movie(having already retrieved a suitable French subtitle in ES-french.srt).

In that case, we'd run something like this:

Removing other languages from dual-language subtitles

This is easier for some languages than others. For example, it's easy to detectand isolate lines containing CJK characters from lines containing (say)English, since their range of characters tends not to intersect.

It's more difficult (and more error prone) to try to detect languages usingmore advanced heuristics, but there are a few ways that you can do it usingsrt.

srt has a program called lines-matching, to which you can pass anarbitrary Python function that returns True if the line is to be kept, andFalse otherwise. This means you can easily build your own heuristics forlanguage based detection, or anything else you want to isolate.

As an example, this is how you would isolate to Chinese lines usinghanzidentifier (must be installed):

You can pass -m multiple times for multiple imports. -f is a function thattakes one argument, line. In this case, hanzidentifier.has_chinese alreadytakes one argument, so we don't need to do anything complicated.

As a more general solution, there is also langdetect, but since this isheuristic, you may find it gets it wrong some of the time. For example(langdetect must be installed):

Notice that we have to use double quotes instead of single quotes inside thesyntax block, since we're already quoting the expression itself with singlequotes.

Using the muxed Spanish and French output we generated earlier as input, thisoutputs the following:

Ffmpeg Extract Subtitle From Mkv

Notice that one line — <i>sur la voie 'B.'</i> — is completelygone. Language detection is not an absolute science, and sometimes langdetectgets it completely wrong, particularly on short sentences without much contextand with language-ambiguous words. For example, in this case, it's very unsurewhat the language is because the content is quite short. Notice that itscertainties vary wildly between runs, sometimes even completely omitting French:

Ffmpeg Subtitle Stream

One thing you can do if you want to match per-subtitle rather than per-line(which only makes sense if your different languages are actually in differentSRT blocks) is use -s/--per-subtitle, which may help to give better contextto langdetect. This fixes the problem above: