![]() |
This article is for people who want an overview of Video Transcoding. |
![]() |
![]() |
||
|
Transcoding is a major pain point for online video companies and distributors of content to multiple devices. Video assets need to be transcoded and delivered into a content management system in order to be published in a video player on a web site. However, with the proliferation of dozens of different archival, encoding and consumption formats, the transcoding challenge has continued to grow exponentially. Today, most media companies perform transcoding by purchasing expensive software and server installations for onsite usage or by outsourcing to a vendor. Other solutions like mPoint’s MODTS provide a higher, quality, more cost-efficient and scalable alternative. All of these approaches have varying degrees of cost, quality, complexity and efficiency implications. For reduction of transport time and storage costs in networks, media compression is used. The compression techniques are usually lossy i.e. they involve loss of video quality. These losses are cumulative, changing from one lossy format to another lossy format causes a progressive loss of quality with each successive format changing. Because of this, it is generally discouraged unless it is unavoidable. Like any emerging technology, many new applications, formats and methods of content delivery have been developed and are rapidly making their places in the market. Opportunity exists in providing equipment that can combine video streams or change video from one format to another. For example, if an individual owns a digital audio player that does not support a particular format (e.g., Apple iPod and Ogg Vorbis), then the only way for the owner to use content encoded in that format is to transcode it to a supported format. A better way is to retain a copy of original content in a lossless format (like TTA, FLAC or WavPack), and then encode directly from the lossless source file to the supported lossy formats. But such a solution is not risk free as they quickly become obsolete. The need of the hour is a video processing architecture which can promise some longevity in addition to addressing the needs of many emerging video communications applications. Variety of formats, like MPEG-2, MPEG-4, H.263, H.264, VC-1 and Flash Video, can be used for creating and storing of Digital video content. There are significant differences among some of these standards which are affecting frame rate and size as well as content-compression level. As the video contents can be seen on cell phones with varying screen sizes and also the user may watch the video while moving or standing still, the network is compelled to compensate for these variables by changing the parameters of the original content. Adapting the media content so that it is compatible with another network (communication links and access terminals) in order to obtain video delivery with acceptable service quality is thus an important issue. Video transcoding converts one compressed video bit-stream into another with a different format, size (spatial transcoding), bit rate (quality transcoding), or frame rate (temporal transcoding). The main objective of transcoding is to enable the interoperability of heterogeneous multimedia networks to reduce complexity and run time by avoiding the total decoding and re-encoding of a video stream. Transcoding is thus the direct digital-to-digital conversion from one (usually lossy) video format to another. In transcoding the compressed video is first decoded/decompressed into a raw intermediate format (such as PCM for audio or YUV for video), in a way that mimics standard playback of the lossy content, and then this raw video is re-encoded and compressed into the target format. Transcoding is a basic network-based process in which one video format is first decompressed into a raw stream which is subsequently compressed into another format. Another important network-based video processing operation is used for combination of uncompressed video streams or images. This stream combination technique is usually used in applications like conferencing and image/text overlay. Transcoding can be found in many areas of content adaptation however it is commonly used in the area of mobile phones content adaptation. Transcoding is a must in the world of mobile content because of the diversity of mobile devices. This diversity needs an intermediate state of content adaptation in order to make sure that the source content will adequately present on the target device it is sent to. MMS (Multimedia Messaging System) is one of the most popular technologies in which transcoding is used. MMS is the technology involves sending or receiving messages with media (Image, Sound, Text and Video) between mobile phones. For example, when using a camera phone to take a digital picture you are actually creating a high-resolution JPEG image, usually at least 640x480 with 24 bits of color. However when sending the image to another phone this high resolution image might be transcoded to a lower resolution image with less amount of color in order to better fit the target device's screen size and color limitation (e.g. 120x160 and 16 bits of color). This size and color reduction not only improves the user experience on the target device but is sometimes the only way for content to be sent between different mobile devices. There is a diverse range of end-user devices, whether it is a laptop, an ultramobile PC or a cellular phone, each supporting different formats of contents. Presently there are a number of formats available for video creation and storage but as the video can be displayed on end-user device in that format only which is supported by the device, therefore the content providers are left with no option but to either store the same video in many different formats or provide some method for converting the video to the format supported by the end-user device. One mechanism is that a default format is extracted by the network from the user-account profile and convert, if necessary, the current format to that format which is supported and compatible with the end-user device. A real challenge is faced when running conferencing applications. In conferencing the real-time communications involve multiple video stream and the formats of all may be incompatible with each other. Latencies are caused when packets are moved across the networks and by processing. As most, if not all, of the acceptable latency budget is consumed in movement of packets therefore keeping the latencies within the acceptable limits become a stringent requirement. Providing sufficient buffering at the end point can significantly help in masking the packet-movement latencies which occur during the streaming of content from a storage medium. Unfortunately, format conversion causes long processing delays resulting into audio-synchronization issues and long start times which will be intolerable for the conference participants. A conference may be viewed by the end users in different layouts depending on the application, for example, only the loudest talker in one frame or each participant in an array of frames. For viewing multiple frames, it is necessary that the video streams of the participant be processed and synchronized with the audio content which itself is being decompressed, combined, compressed and packetized in a separate stream. But analysis and synchronization of audio stream are required even in a simple “loudest talker” conference which involves switching of video feeds rather than combining them. Today’s technological advancement has also enhanced audio quality. Current audio technologies offer more bandwidth for better fidelity through wideband coder with spatial resolution which enables a high-quality speaker system to provide directional to mimic the natural environment in a face-to-face meeting. Various functions, like special effects creations, assisting in interpretation of contents for the deaf and helping ensure digital rights management, require the image and text overlay applications. In such application it is necessary to first convert the video content to its uncompressed (raw) format and then the additional content (like scrolling of text messages or news at the bottom of the screen) be combined with it before being compressed and packetized. Fortunately such content, unlike the conferencing, is often streamed from a storage endpoint therefore latency is generally not an issue. The required processing power for the transcoding or combining an image is a function of the number of bits processed which in turn depends upon the overall image quality required. The H.263 Quarter Common Intermediate Format (QCIF) stream requires as much as 200 less processing power than the compression and decompression of an HD stream in H.264 format. The solution provider who is investing in video transcoding must purchase a solution like mPoint’s MODTS which is easily scalable from initial service introduction of a few hundred simultaneous channels to a more mature deployment of a few thousands of channels. In addition, it is almost certain that the compression-format standard will change over the lifetime of a service therefore flexibility in a transcoding solution is must. To support this point it is important to highlight that the H.265 investigation has just been launched by the ITU-T Video Coding Experts Group. Considering the above requirements the following four challenges are identified which must be addressed by the current designers in equipment design in devising a solution:
From the above listed design requirements we can define four key design processes vital for video processing architectures and those are Scalability, Versatility, Density and Programmability. A transcoding solution must be scalable in two dimensions: The number of channels and the amount of processing power per channel. Another important consideration is that the complexity of an encode-compression scheme for video can be two to ten times (depending on the type and complexity of the coders used) as compared to the decode-decompression scheme. In fact, an additional five to ten times variation can be caused by asymmetry in processing power while transcoding video from one format to another (Like transcoding H.263 format video to MPEG-4 format). In a conferencing solution, the required amount of processing power for a single conference depends upon the total number of participants at one time. Two or three processors are required for large video conferences at low resolution, similarly moderate conferences at high-definition resolution can involve up to ten or more processors. From a design perspective, to approximate the best price/performance curve a set of separate hardware designs can be created, with each optimized for different capacities or algorithms. Unfortunately, numerous designs are required to be maintained in this approach, perhaps with significantly different components and worst of all different code bases. The challenge is that a common code base and, if possible, a common modular hardware design be established for all requirements. A modular design typically requires replication of processor with 1-N of the same processor on some extensible fabric; however, before rushing into design directly one must consider the other implications of this solution. A versatile platform is required by all whether these are new variations on old algorithms, completely new algorithms or varying demands on algorithmic instances (referred to as “algorithm volatility” by us). With the help of such a versatile platform, a manufacturer can maintain market-leadership position by the quick introduction of new algorithms or new features that differentiate the product line. Such a platform helps a manufacturer maintain a market-leadership position by quickly introducing new algorithms or new features that differentiate the product line. But our second requirement outshine the versatility i.e., heavy-duty processing power. Longevity is attained by the designs through versatility and therefore some level of general-purpose functionality is required. However, a mix of processor types must be considered by the designers because considerable processing power is needed along with versatility. In terms of efficiency, a general-purpose processor (such as an X86) is normally no match for a DSP (Digital Signal Processor). As a result, for a given algorithm, the same number of channels cannot be produced by general-purpose architecture as a DSP can in the same space and with the same electrical power. Since the overall dynamic range of a product offering is determined by the maximum number of channels, for an industry-leading design to achieve the lowest cost, size and power, it must achieve the best possible density. Although the manufacturers have made some strides in giving DSPs more design flexibility but DSPs lack the overall versatility of general-purpose CPUs. Algorithmic solutions with optimized code are provided by many DSPs manufacturers that can be licensed for a fee. The advantage of this approach is that a considerable time-to-market can be saved and programming risk is reduced greatly. The disadvantage is that algorithmic solutions tie the hands of integrators which eliminates their ability to provide major differentiating value in their products and to roll out new product features. The integrators must have their own programmers for creation of the code base for the processing architecture if they want to timely add differentiating value (and support their products at the high level required by network service providers). Availability of a solid set of tools to the programmers in a development environment with which they are familiar is a must for further reduction of time and risk. The use of some of the new low-level solutions, that prepackage DSPs (some with algorithms) on an industry-standard board such as AdvancedTCA or MAC, pose several problems which are highlighted by the issues discussed here. Although certain undeniable time-to-market advantages are obvious but the restrictions must also be considered. These new solutions are modular, but only with respect to the features and performance provided by the vendor and in their size increments. Such an approach not only compromise the integrators’ ability to differentiate products but also these predigested solutions often lack the communications fabrics needed for scaling in the required dimensions-as well as the general-purpose processing required for industry-leading versatility. A better approach is provided by using a balance of general-purpose processors and tightly coupled accelerators. Most appropriate would be a variety of accelerators like DSPs, ASICs or processing arrays. The overlying software structure, algorithmic partitioning and overall communications fabric are the important aspects of this kind of design strategy. A general-purpose CPU runs the overlying software structure, and the structure must abstract the type of algorithm used and the acceleration device that is processing the algorithm. Advantage of such an approach is that it allows new algorithms and acceleration technologies can be introduced quickly without affecting the application. The application itself can run on the general-purpose CPU or on a remote server via remote media-and call-control protocols (SIP, MSML, VXML, H.248, etc.). A general-purpose CPU not only handles local control, management and data routing but can also be used for establishing new algorithms. The objective of algorithmic partitioning is to assure that the accelerators run older, more stable algorithms and the most processor-intensive operations. Also it must be kept in mind that the accelerator need not to run the entire algorithm. Generally speaking, in a media compression algorithm about 70 percent of the processing power required for the entire algorithm is consumed by only 10 to 30 percent of the tasks. Offload acceleration is required by such media-compression tasks, but this offloading needs tight coupling between the general-purpose CPU and the accelerator. Finally, an easily scalable overall communications fabric is required that provides sufficient performance for function-offload partitioning. The bandwidth for HD video channels must be supported by the fabric. The fabric must also support the bandwidth for the routing and switching which require using multiple processors efficiently for a single conference while avoiding unacceptable latency. An internal protocol for media-stream routing with very low processing overhead (such as iTDM for local chassis communications or a pseudo-wire protocol for multichassis solutions) must also be provided by the fabric. One can create a video processing solution, by the use of the right mix of complementary processors and by intelligent partitioning of the processing load, which is both versatile and scalable across any channel density and algorithmic complexity. A variety of video formats can be implemented with relative ease on PCs and in consumer electronics equipment which has made it possible to make available multiple codecs in the same product thus eliminating the need for choosing a single dominant format for compatibility reasons. Considering the wide variety of video format and the ease with which they are implemented, it seems very unlikely that only one codec/video format will replace them all. Some of the widely used video formats (mentioned in the text above) are listed below: H.261: H.261 was developed by the ITU-T and it was the first practical digital video compression standard. It is used primarily in older video conferencing and video telephony products. All subsequent standard video codec designs are essentially based on this standard. H.261 supported only progressive scan video. Some well-established concepts are included in it like YCbCr color representation, the 4:2:0 sampling format, 8-bit sample precision, 16x16 macroblocks, block-wise motion compensation, 8x8 block-wise discrete cosine transformation, zig-zag coefficient scanning, scalar quantization, run+value symbol mapping, and variable-length coding. MPEG-1 Part 2: This standard is mainly used for VCD (Video Compact Discs) but sometimes online video also uses this standard. VCD can look slightly better than VHS provided that the source video quality is good and the bit-rate is high enough. A higher resolution is mandatory if the quality is to be better than the VHS. However, to keep the file fully VCD compliant, bitrates higher than 1150 kbit/s and resolutions higher than 352 x 288 should not be used. VCD enjoys the highest compatibility as compared to any other digital video/audio system. Talking of the compatibility, there are very few DVD players which do not support VCD, but inherently they all support the MPEG-1 codec. Also using this codec, every computer in the world can play videos. Technically speaking, the most significant enhancements in MPEG-1 relative to H.261 were half-pel and bi-predictive motion compensation support. MPEG-1 also supports only progressive scan video. MPEG-2 Part 2: (a common-text standard with H.262): This standard is used on DVD, SVCD, and in most digital video broadcasting and cable distribution systems. It offers good picture quality and supports widescreen when used on a standard DVD. It is not as good as DVD when used on SVCD but it is definitely better than VCD due to higher resolution and allowed bit-rate. Though uncommon, MPEG-1 can also be used on SVCDs, and anywhere else MPEG-2 is allowed because MPEG-2 decoders are inherently backwards compatible. In terms of technical design, the most significant enhancement in MPEG-2 relative to MPEG-1 was the addition of support for interlaced video. MPEG-2 is now considered an aged codec, but has tremendous market acceptance and a very large installed base. H.263: It finds its prime utility in video-conferencing, video-telephony, and internet video. H.263 represented a significant step forward in standardized compression capability for progressive scan video. It could provide a substantial improvement in the bit-rate, especially at low bit rates, needed to attain a given level of fidelity. MPEG-4 Part 2: This is an MPEG standard that can be used for internet, broadcast, and on storage media. It offers improved quality relative to MPEG-2 and the first version of H.263. It also included some enhancements of compression capability, both by embracing capabilities developed in H.263 and by adding new ones such as quarter-pel motion compensation. Like MPEG-2, it supports both progressive scan and interlaced video. MPEG-4 Part 10 (This standard is often also referred to as AVC). This emerging new standard is the current state of the art of ITU-T and MPEG standardized compression technology, and is rapidly gaining adoption into a wide variety of applications. It contains a number of significant advances in compression capability, and it has recently been adopted into a number of company products, including for example the XBOX 360, PlayStation Portable, iPod, the Nero Digital product suite, Mac OS X v10.4, as well as HD DVD/Blu-ray Disc. DivX, Xvid, FFmpeg MPEG-4 and 3ivx: Different implementations of MPEG-4 Part 2. WMV (Windows Media Video): Microsoft's family of video codec designs including WMV 7, WMV 8, and WMV 9. It can do anything from low resolution video for dial up internet users to HDTV. The latest generation of WMV is standardized by SMPTE as the VC-1 standard. VC-1: SMPTE standardized video compression standard (SMPTE 421M). Based on Microsoft's WMV9 video codec. One of the 3 mandatory video codecs in both HD-DVD and Blu-Ray high-definition optical disc standards. Commonly found in portable devices and on streaming video websites in its Windows Media Video implementation. RealVideo: This standard was developed by RealNetworks. It used to be a popular codec technology a few years ago but now fading in importance for a variety of reasons. x264: A GPL-licensed implementation of H.264 encoding standard, x264 is only an encoder. Huffyuv: Huffyuv (or HuffYUV) is a very fast, lossless Win32 video codec written by Ben Rudiak-Gould and published under the terms of the GPL as free software, meant to replace uncompressed YCbCr as a video capture format. VLC (VideoLAN Client) is a media player capable of playing various multimedia formats profiting especially from the ffmpeg project decoder and exploitation of some original Windows codecs. Furthermore, VLC is very effective in handling a streamed multimedia content allowing to play streams sent by streaming servers, another VLC or even hardware encoders. VLC is capable of handling streams transmitted using RTP or bare UDP protocols over unicast as well as multicast networks. These features are ideal for purposes where the VLC is responsible for unpacking the multimedia data from the UDP stream, transcoding the data and repacking them again to the UDP stream. mPoint’s MODTS solution has been the most significant breakthrough in the transcoding space recently. The mPoint On-Demand Transcoding Solution (MODTS™) is a cost-effective transcoding solution that ingests audio or video content in any file format, encodes it into any number of derivative formats, and delivers media files to servers or content delivery networks for consumption within minutes. No hardware or software footprint is required and pricing is usage based. MODTS™ utilizes a patent-pending system for dynamically creating processing system nodes and releasing them when finished. Capacity scales based on the number of transcoding jobs requested and the priority granted to each. Once content is ingested, the Smart Queue prioritizes the job and prioritizes the processing nodes necessary to provision for processing the batch in the shortest amount of time possible. MODTS™ scales storage, request rate, and users to support an unlimited number of transcodes. MODTS™ is highly reliable. It stores data durably with 99.99% availability. There are no single points of failure. All failures are tolerated or repaired by the system without any downtime and replacement instances can be rapidly and reliably commissioned. MODTS™ performance has been tuned to support high-performance media publishing applications. Server-side latency is insignificant relative to Internet latency. Any performance bottlenecks are cleared by simply adding nodes to the system. Proprietary, decentralized, load-balancing algorithms are used to remove scaling bottlenecks and single points of failure so that transcoding speeds increase as the transcoding request load increases. MODTS™ is simple to use. Through the MODTS™ dashboard, users can create specific workflow channels, or profiles, which specify a means for the ingested source file to be transcoded into the acceptable formats and delivered to the correct CMS, CDN, partner and/or hosting platform. The user can then sit back and view the real-time dynamic allocation, queue status, batch prioritization and other statistics about the current state of the system and their account. The MODTS™ security system allows the user to place running instances into groups and specify communication between those groups. It also allows the user to specify which IP subnets on the Internet may talk to these groups. This process allows access control to processing and storage instances in a highly dynamic environment. In addition, every instance is protected by an extra layer of Linux host security. |
||
![]() |
![]() |
![]() |