Transcode movies in the cloud

If you’d just like to know how to actually do what the title says you better skip down to the “What do you need?” and “How to use it?” section. First here is the story and some background.

Background
I’ve been a XBMC user for a while now. For me the main strength of XBMC is that you can just drop almost any media file at it and it will play it. Also the scrapers are great to save time in getting your library sorted. Where XBMC falls short Plex comes in for the rescue (for Mac users at least). Plex bundles a fork from XBMC which provides a streamlined UI with a set of plugins that will actually work. For some reason however Plex was way more unstable on my Mac Mini than XBMC ever was and since my main use was to watch movies on it i switched back to XBMC eventually. Being a fan of AirPlay I found the new Apple TV to be the tipping point that got me to buy one. Now I was sitting on a 300GB library in various formats from which the Apple TV could only play a small number. Having seen the power of ffmpeg before (I believe YouTube still does all the processing with it) I was interested in the results it would deliver across my library. I ran a few samples and was bummed by the performance it was delivering on my Mac Mini. An average movie transcode could chew up to 5 hours of processing on the machine. At this pace getting those movies ready for the Apple TV would take almost 2 months.

EC2 to the rescue
When I started looking into this it was just a couple days ago that Amazon announced the availability of GPU instances on EC2. I thought about using those but it seems that up until this time GPU support in ffmpeg is still far from being stable.

Still I went ahead to do a test run with a few movies and to my surprise the same movies on a EC2 small instance took on average only 3 hours to finish. I am not sure and did not try to figure out why this is or if I only had a crappy compile of ffmpeg on my Mac Mini because by now I was already decided to get my library transcoded in under a day in the cloud.

Bandwidth cost
At the moment the biggest cost in this endeavor is bandwidth cost in and out of EC2. For datasets in the few hundred GByte area the AWS Import/Export facility is not much cheaper than still uploading them via the Internet (assuming you have a decent pipe and enough bandwidth at your disposal). My mail to Amazon trying to evaluate why the GByte price with them is far above usual market prices did not spur a response so far.

Seeing that Level3 (amongst others) is one of their main upstreams and the amounts that they probably purchase from them it is safe to assume that they are able to buy 1MBit for around or under 10 Euro per month. This will factor down their up-/downstream cost to about 0.3 Cent per GByte. Now taking into consideration that they still have to provide a decent network within their own facilities to push the bytes down to your EC2 instance one could calculate another 0.2 Cent per GByte here. This is in fact cutting them a lot of slack since they only have to push the same bytes within a datacenter or over quite short distances whereas Level3 for a marginally higher price could possibly end up pushing them around the globe

Anyway – this totals to 0.5 Cent per GByte which puts the current going bandwidth prices for the EU Ireland area at a whooping 100-200% above average market price. If Amazon can hold this bandwidth price they might be the one provider that found a recipe to keep those premiums in the bit-stream market. :)

Also I cannot understand why there is a distinction for incoming traffic versus outgoing traffic. Technically there is no more cost involved in any one of them. I considered possible peering imbalances on their network that they could try to counter with this but for such things there are way better mechanisms like acquiring bandwidth hungry customers and forwarding their packets at the network edge. Boy, if you wanna be paranoid you might even say that they charge more to get the data out of their cloud so they can lock you in. ;)

If their network has a very uneven distribution of traffic they might want to consider to provide off-peak charges for bandwidth similar to what they are doing with spot instances already. A lot of this data transfer is not bound to happen at specific times and this could benefit both parties.

To summarize transferring in 300GB of movies to S3 and getting it out after transcoding will run up around 75 USD in bandwidth charges. Follow the article to see why this is a let down in this example.

Update 30/06/2011:
AWS just announced a new pricing. They are now not charging for inbound data transfer and the outbound has been lowered. This change would bring down the 300GB example from above to 36 USD.

Processing cost
Spot instances. If you haven’t used them on EC2 yet now is the time to do so. They are basically instances you can bid on to get them for a certain price way under their normal selling point. Amazon probably uses Spot instances to ensure they have enough capacity for peak loads but at the same time not having those run idle. Basically the spot instance is yours as long as the price which goes up and down does not exceed your bid. If it exceeds your bid and the capacity is needed your instance will be shut down without further notice. That sounds bad at first but for the batch processing intended here it does not matter much especially looking at the savings that occur when using them. For more information on how spot instances work in detail it’s best to consult the AWS documentation directly. Personally I think you don’t need to set your bid very close to the currently going price but rather put it a few cents higher than the last 24 hour peak. This way you will most likely have the spot instance without any interruption and still occur a very good saving.

For my 300GB example I consumed around 1000 hours on EC2 small spot instances. At the time of processing this the instance price was hovering around 0.038 cent. The default limit of concurrently running instances allows for 100 spot instances to run simultaneously. For this run I used the full range of them starting with 10 instances and progressing to 50 and then 100. The total cost after about 12 hours later read 40(!) USD.

This is also why I am not happy with the current bandwidth pricing of EC2. Using the raw CPU power of the infrastructure you can do amazing things and that for a very affordable price. If bandwidth charges would have been a bit closer to current market I could have transcoded my entire movie library for 65 USD. This way it cost me almost double at 115 USD. I am however still happy with the result – after all my poor Mac Mini would have run on 100% CPU for 2 months otherwise and who knows if the CPU would have even survived that. :)

What do you need?
I choose to run this with the Ubuntu UEC images provided for EC2. These images ship with a neat extension called cloud-init. In a nutshell cloud-init offers a structured way to use the userdata facility from EC2 to get packages installed on the system, execute some scripts and so on. It’s basically the thing you would use on an environment to bootstrap your Puppet, CFengine or any other configuration management you might be using. But at the same time this allows for a very neat way to inject a batch job with little effort.

In this particular example a set of packages is installed from the Ubuntu repositories and afterwards ffmpeg (and codecs) and S3FS are compiled. I would use packages wherever possible but at the time of writing neither S3FS nor a ffmpeg with the necessary codecs was available from a respository I could find.

Once all of this baseline setup completes the main script called transcode.sh will execute. This script might look more complex than it actually is. It is basically a loop looking in a S3 bucket if there are any more files to process. If it finds one it will lock this file and start transcoding it with ffmpeg into another bucket. If no more files are available the instance will shutdown. Once everything has been processed all your instances will be shutdown and your result resides in the output bucket. Voila.

How to use it?

  1. Create two buckets on Amazon S3 (input and output)
  2. Upload your movies to transcode into the input bucket
  3. Edit s3fs-install.sh with your AWS credentials and bucket names
  4. Edit the URLs in cloud-config.txt to point CONFIGURE.YOUR.HOST to point to a host where you’ll make these files available (make sure they can be retrieved from this URL from EC2)
  5. Upload all the files (ffmpeg-install.sh, x264-install.sh, s3fs-install.sh, transcode.sh, cloud-config.txt) onto the host you configured before
  6. Start your EC2 instances (for instance via the AWS Management Console) with the 32-bit instance AMI for your region and when asked for the userdata provide the content from the cloud-config.txt file
  7. Watch the magic unfold!

What’s next?
Well this implementation is far from being perfect. It was basically a proof of concept and the locking approach in a distributed environment comes with all of it’s problems. Hanging processes will not be restarted, etc. All in all the final result needs to be verified by hand. Using something like ZooKeeper in a Hadoop environment might be a way better way to go forward. After all there are Java libraries wrapping up FFMPEG functionality.

Other options?
I did not have the time to research what other projects/products might be out there. Please do comment if you know any.

So far I stumbled upon Transloadit while I was putting this together. The functionality they offer looks really interesting (realtime streaming of the encoded video!). However the pricing for my 300GB example would turn into a total cost of 540 USD. I however understand that their service is not primarily targeted at what I wanted to achieve.

Another option would be Hey!Watch. Their pricing is a bit harder to estimate since they charge based on the length encoded. According to my calculations I believe that my task could have been accomplished with a budget of around 250-300 USD with their service.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s