MediaProduction

From MobileDesign

Jump to: navigation, search

There are several guides available various places online to help you encode your video content for mobile delivery, but encoding is only the last step of the media production process. This guide gives media content producers, directors, and photographers the information necessary to develop high quality multimedia content for delivery over wireless handheld devices.

There is a discussion of this page on the Talk:MediaProduction page

Contents

[edit] Background Issues Affecting Mobile Development

Many media players support a variety of MPEG-4 file formats (including AAC and QCELP audio) along with RAM audio. Unlike most previous video formats, MPEG-4 was designed with mobile users in mind. It therefore has extremely good compression and switches from the "frame model" to the "object model" for video encoding.

The "frame model" has video being encoded, transferred, and displayed one frame at a time. Opportunities for compression lie in detecting which pixels do not switch from frame to frame, along with normal image compression.

The "object model" has video defined as a composition as several objects, which can include a "background", a "sprite", and a "face" — or any other combination imaginable. In fact, MPEG-4 is built on top of the virtual reality markup language (VRML). Opportunities for compression lie in downloading the background just once (even when the foreground moves in front of it), detecting patterns such as tiles and storing images as collections of repeating tiles, understanding how sprites (bodies) move, refreshing different parts of the screen at different rates, etc.

Video using the frame model is "not scalable" and is subject to dropped frames in poor network conditions, which could cause the audio and video to become un-synchronized. Video using the object model is "scalable" and can detect poor network conditions and reduce refresh rates for less important parts of the image.

Note that MPEG-4 supports both models, although most 2004-era players support only the "frame model" (Simple Visual Profile), which is essentially equivalent to MPEG-2. With future devices, you can reap the benefits of the "object model". Frame Rate & Refresh

Frame rates for mobile devices pose an interesting problem for producers. Frame rates in commercial television have been constant for well over half a century, and professionals have an intuitive understanding, built from long experience, regarding what kind of action shooting works well, and what kind requires special handling. Mobile frame rates are low and variable, but use advanced encoding techniques such as selective refresh to achieve a better perception of quality than actually exists. Further, mobile devices often momentarily lose connectivity, and much goes on behind the scenes to minimize the effects of such loss on the perceived quality of service. The net effect is that small changes which tend to stay in the same relative position in the frame are handled far more gracefully than the kinds of whole screen changes which might occur if, say, the camera were rapidly panned.

[edit] On Streaming Bandwidth and Image Quality

Modern data networks are designed to provide as much bandwidth as possible to each user, but to avoid denying services to any user. Thus, depending on network conditions, users could see network speeds as low as 9600 baud or as high as 300 kb. Speeds can change on a moment by moment basis, as the network takes advantage of disconnects and pauses in conversations to reallocate resources.

Unfortunately, despite the promise of MPEG4 adaptive bandwidth technologies, most current implementations do not allow the content provider to assess the current bandwidth available for the user. The server will therefore be unable to take advantage of any technologies that vary image quality as a function of bandwidth.

When talking about simple streaming video + audio using the "frame model", the image quality follows a simple formula: Image Quality

In this equation, you have control over AudioRate (12 kbps is generally good) and FrameRate (5 fps is usually good), but Bandwidth changes beyond your control. Of course, we can measure image quality in Kb, but the important measure for image quality is user perception. If users’ perception of your images is excellent when Bandwidth is only 38kbps, then an increase in Bandwidth to 77kbps will not increase image quality.

image sitting at quality equation

Various research has examined the subjective perception of mobile multimedia. Across a wide variety of content scenarios:

   * The minimum acceptable audio rate was 12kbps regardless of frame rate
   * Good audio enhances the perceived quality of video
   * 5-10 frames per second (whole-screen refresh), with at least 31 Kbps, are acceptable except for content with high amounts of action

To make high-action clips work well, the network's data rate needs to exceed 60 Kbps and 10 frames per second. Note that many add-on viewers available as of 2004 can only deliver 2 frames per second for streaming content. Please see Determining Minimum Object Resolution for equations relating frame rate, image quality, and other relevant data.

[edit] Production and Preprocessing

The better quality you can obtain from your production and preprocessing steps, the better quality the final product will be, even after compressing and reducing for the narrow bandwidth. We recognize both that this section is a review for long-time professionals in the field and insufficient for newcomers, but we hope that between this document and other information available, newcomers can get started without too many detours.

   * Avoid consumer-grade recording formats such as VHS and S-VHS. Instead use a professional grade such as BetaSP, DVCAM, or DigiBeta.
   * Record in a digital format. This prevents you from having to translate from analog to digital, which can degrade quality.
   * Capture data at full frame size and rate. You’ll need to reduce it later, but the reduction process can sharpen image edges.

[edit] Video

   * Never capture video to the computer drive with your swap file. This is usually the drive where you have your operating system, unless you have defined your swap file to be elsewhere. Basically, data will be incoming at a rate not much slower than the drive’s write speed. If the operating system needs to use the swap file for "extra RAM", or just needs to check the swap file (as operating systems are wont to do), the time involved to move the head over to the swap area and back will cause the input buffer to overflow and you will irretrievably lose some of your data. Of course, the more RAM you have on your system, the less likely your buffer will be to overflow.

As discussed elsewhere, the user's perception of image quality is affected by the quality of the audio. More compelling audio makes for better user perception of the video. Further, the device may well be used while actually mobile. Therefore, the viewer's visual attention will almost certainly be divided across many tasks. Based on both user perception and user tasks,

   * Use the images to support the audio, not the other way around. Think radio broadcasting with supporting imagery.

You can’t be sure of frame rates, which will vary between 2 and 20 frames per second. Any actions that result in large portions of the screen needing to be refreshed are likely to result in pixellation and smearing. Techniques to handle this issue include:

   * Cut between shots instead of panning to follow action, where possible. Cutting between the start and end points of action sequences may lose the action in between, but only results in a single full screen refresh, instead of causing the network to attempt to push 8 or 10 full screen refreshes down for about a second’s worth of action.
   * If you pan, keep the video subject (vehicle, player, ball, etc.) tightly cropped and centered. The background will likely smear and pixellate, but, used sparingly, in a replay type situation this can work.
   * In sporting events, use replays and slow motion or stop motion to provide a view with compelling levels of detail. Again, keep the video tightly cropped.
   * Minimize the effects of rapid motion. Where possible, use shots that move towards or away from the camera.

The most obvious issue with mobile multimedia is the fact that the screens are small. This obviously results in fewer pixels per screen. The appendix discusses mathematics and practices for determining a physical object's resolution on the screen. In general, don't expect much detail.

   * Avoid wide shots if any fine detail is important to the shot. While wide shots are often used as establishment shots, and can be effectively used for that purpose with mobile devices, keep them short. Per the example resolution calculations in the appendix, a scoreboard likely will be unreadable on a shot wide enough to show an entire ball field.
   * Shoot more closeups.
   * Avoid watermarks. They won’t scale down. Text background data, such as crawlers, box scores, etc. will have to take up a much larger percentage of the screen to be effective.

As can be seen from the above recommendations, better results will be obtained by using a specialized production unit, rather than merely attempting to reformat broadcasts for mobile distribution.

The technology will get better, but these recommendations will (largely) still apply. Faster, more reliable networks, better compression and smarter pixel scaling algorithms are all in the works. More pixels on the display device, larger memory and more processor power will certainly become common in the future. However, one thing won’t change: for a device to be mobile it’s got to be small. So, unless video glasses take off, resolution may no longer be limited by the number of pixels, but by the absolute size of the displayed object on the mobile screen.

From the above, don’t trust conversion tools to take your broadcast content and just "do it". Look at the storyboard for your broadcast. Identify from the storyboard the shots that won’t work mobile, and the shots that will. This will tell you what is needed in the way of extra resources for a broadcast/mobile combined production. It may be possible to do without any extra cameras, or it might require one or more cameras dedicated to capturing mobile optimized shots. The two storyboards should tell you.

[edit] Audio

If images support the audio, then the audio is the primary medium. Make sure that your audience would be satisfied with the production even if it were audio only. Consumer testing has shown that low resolution, low frame rate images are perceived as being much better than they actually are when accompanied by better quality audio.

   * Deliver audio in mono, not stereo. Stereo obviously takes up more bandwidth than mono, and only a very few devices support stereo playback in any way.

The ambient noise floor around the user is highly variable, even more so than in an automobile.

   * Avoid content with a large dynamic range. Unfortunately, traditional analogue audio processing techniques to compress dynamic range may result in less efficient digitization of the signal. We have not found a formula which will predict how much analog audio processing can be performed before the digitization process suffers. Be aware that it can be a problem.
   * If the content is mainly speech, use QCELP encoding and assure that the analog signal is well modulated. You should be able to completely avoid the audio processing problems mentioned above.

The audio content in a broadcast may not translate well to mobile. Just as you would identify places in the storyboard where you would use alternate camera work for the mobile version, you should do the same for audio content. If, for example, the storyboard calls for "letting the picture speak for itself," consider adding audio content to support the image.


[edit] Post-Production

Wireless networks deliver speeds that vary from moment to moment, depending on network conditions. Thus your user may start viewing your clip at 77 kbps, drop to 19 kbps, and finish at 55 kbps. Unfortunately, despite the MPEG-4 standard, most networks do not yet allow you to detect network conditions and respond accordingly.

Compression of motion pictures involves reducing the amount of information that is needed to describe a series of frames by sending full information only once every several frames and using techniques to "predict" the other frames and describe only the changes from the anchor frame (key). The fully described frames are called key frames (I frames). Predicted frames (P) follow key frames wherein P frames contain the changes to the anchor frame.

   * When given the opportunity, choose "lossy" encoding. "Lossless" encoding will give slightly better quality, but between lost frames, low bandwidth, and limited device capabilities (both audio and video), users are unlikely to notice. In exchange, you get better compression, which improves the user experience. You may find that lossy encoding allows you to encode at a higher frequency.
   * Only provide URIs to content on your own servers. Do not use MPEG-4 capabilities to advertise or launch a browser.
   * Do not embed a URI to inform you that the user has finished viewing your content. There is no need to add extra network traffic or violate users’ privacy.
   * Convert from analog to digital only once. If you must capture in analog, ensure that you do not transfer to disk, then back to analog, then back to digital again.
   * Use professional-grade encoding hardware. SDI or Firewire are the best choices.

The type of content you are capturing affects users' perceived quality. For example, a "talking head" can deliver a good user experience on relatively slow networks with lower frame rates, whereas a sports clip requires a higher visual quality. There are several things you can do to reduce the need for high-fidelity shots, as outlined above. However, if you have audio content it needs to be at least 12 kbps to support the video. A lower quality audio actually degrades the perceived quality of the video.

Most networks deliver video content in individual, full-screen refreshes. Achieving the minimum 5 frames per second for most content (10-15 for high action content) is more difficult in this environment. If the media delivery system is using the MPEG4 capabilities of partial-screen refreshes, you will get higher perceived frame rates.

Please see Determining Minimum Object Resolution for equations relating frame rate, image quality, and other relevant data.

   * Consider supplying "low bandwidth" and "high bandwidth" versions of challenging content. If the user tries the high bandwidth version and finds that the network does not currently produce a good experience, the low bandwidth version is available.

[edit] Audio

Different players will support different file types for audio. AAC is common, as is QCELP. AMR and QCELP are targeted at voice and are very efficient (and predictive). AAC is targeted at higher-fidelity applications such as music. AAC for MPEG-4 can have lossless quality at 64 kbps, and is designed to be "almost as good" at frequencies down to 16 kbps.

AAC Low Complexity Profile is within the MPEG-4 Scalable Audio Profile. There is some scalability built in (bit rate scalability, encoder scalability), but relying on these may generate unpredictable results. As some developers have noted, "It’s amazing when it works, but it doesn’t always work."

   * Manually compress your AAC audio as much as possible. If you rely on the compression allowed in the Low Complexity Profile, you may have unpredictable results. Use techniques such as limiting, sample rate conversion, stereo-to-mono conversion, low- and high-pass filtering, and noise reduction to reduce required bandwidth. Realize that with 2.5G networks, your users will average approximately a 38 kb connection — but half the time it will be lower.
   * Use QCELP rather than AMR, AAC, or other encoding options for voice applications. You’ll save your users some bandwidth.
   * When using QCELP use full-rate QCELP audio encoding. The experience is significantly better with a full-rate (13 kbps) experience over a half-rate (8 kbps) experience. Users will also perceive your video as being higher quality.
   * Use 12 kbps for streaming audio. Across all multimedia types, audio streams of 12 kbps were associated with the highest quality multimedia experiences.

[edit] Video

   * Use 8 frames per second for video. Some devices will not be able to support more, and users find this rate acceptable.
   * The ratio of predicted to key frames must not exceed 10 to 1.


[edit] Determining Minimum Object Resolution

What is the smallest object that can be seen in a given image? This is a key question for the professional content creator in the mobile market, as the content creator will likely bring a wealth of experience and equipment aimed at generating quality content for much higher resolution display devices (even standard television). The information below represents some easy approximations for estimating whether or not a given camera shot has a chance of being usable on a mobile device.

Note that this is not a problem for amateurs authoring content using such a mobile device, as what the amateur sees on the device reflects what the viewer will see.

The limiting factor in digital imaging resolution is the number of pixels available for storing image data, versus such traditional technical film photography factors as film grain size, chromatic aberration, etc. With that assumption, we can calculate the minimum resolution for a target device with the following:

   * The angle of view of the lens creating the image.
   * The distance from the lens to the subject.
   * The pixel dimensions of the target display.
   * Optionally, the screen size of the target display.

Angle of view of a lens / sensor combination may be calculated using the following formula:

[Angle of view of a lens sensor combination] equation


Noting that magnification never comes into play except in macrophotography (or microphotography), that part of the equation may be removed, leaving us with:

   * AoV — Angle of View
   * FD — Film Dimension
   * FL — Focal Length

[Angle of view without magnification factor] equation


The film dimension (diagonally) of 35 mm film (36 x 24 mm) is 43.27 mm. Given a 50 mm lens, AoV=2*ATan (43.27 / (2*50) ). So, our "normal" 50 mm lens on that trusty old Nikon SLR is yielding a 46.53° angle of view.

Take that same lens and put it on a Nikon D100 digital SLR, and the "film" become a sensor, with a 28.31 mm diagonal measure, making the AoV 31.61°, or a bit more "telephoto," roughly equivalent to an 80 mm lens (closer to a 75 mm, but those are rare).

Field of View at a given distance can be estimated by taking:

[Field of View] equation

So, the minimum resolvable object at a given distance D, is the field of view divided by the pixel width of the final display device. Hence:

[Object Resolution] equation

Or, in the case of a "normal" lens, with a 50 degree AoV, at 500 feet, our total field of view is 466.3 feet. If our final display device has a resolution of 160 pixels across, then the narrowest object we can see is 2.91 feet across. This means that the outfielder against the back wall of a baseball stadium has just ceased to exist.

For our final trick, here is a physical method of calculating the AoV of a given lens (or given FL setting on a zoom):

  • Take a yardstick (or anything with a known size), and set it up so that its length is perpendicular to the axis of the lens in question. Move the lens until the yardstick exactly fills the viewfinder. Measure the distance from the yardstick to the camera.
  • If the yardstick is truly something small, like a yardstick, and the lens is a wide angle lens, make your best estimate as to where the sensor is on the camera, as missing by a few inches will make a real difference at a few hundred feet. You should be able to get within an inch, which should be good enough. Now you've got the distance to your target, and the size of the target. This should yield the Angle of View by crunching through the formula:

[[1]]

Personal tools