Background#
In 2019, the first generation of Zhixun's Yingmou image transmission was released. In an era of rapid growth for self-media, various camera accessories have emerged in a surge, including three-axis stabilizers, image transmission, microphones... Numerous manufacturers, big and small, are involved, naturally leading to many rough products that have been hastily developed. As Zhixun's first generation image transmission product, Yingmou does show some signs of a small workshop vibe, but more importantly, it has many good aspects. The design is not outstanding but very practical, and the user experience is quite good: the latency is low enough to allow for focus adjustments while watching the transmission, and it can be used for 3-4 hours without external power supply. Combined with its launch price of 799 yuan and a subsequent second-hand price of less than 300 yuan, it remains practical to this day.
In terms of transmission solutions, Yingmou belongs to Wi-Fi transmission and does not come with a dedicated receiver. The camera or other HDMI signal sources connect to the transmission transmitter, and reception requires using a mobile device to connect to the transmission via Wi-Fi, allowing remote viewing of images through a companion app. Using Wi-Fi is a suitable solution for low-cost transmission, as general Wi-Fi network cards have much lower development difficulty and cost compared to dedicated wireless solutions. Besides Wi-Fi, there are few options to achieve a rate of 5Mbps or higher for 1080P video stream transmission at a low cost. For a low-cost transmission solution, not providing a dedicated receiver means saving almost half of the hardware costs, and it can utilize existing screens of smartphones and tablets, eliminating the need to purchase additional display devices, thus lowering the barriers to purchase and use, making Yingmou very suitable for individuals and small teams, at the cost of losing the opportunity to connect to large monitors and broadcasting stations.
The initial research on Yingmou was to understand how it achieves a good Wi-Fi transmission experience. After being able to correctly parse the data stream sent by Yingmou, I began to attempt to "customize" the receiving end for it, trying to solve the problem of not being able to connect to large screens or use it for live streaming due to the lack of a receiving end. I implemented two prototypes: one based on Allwinner D1 for the receiver solution and another for the OBS Studio plugin. During the research process, I found that choosing Yingmou was like picking a soft persimmon: the Hisilicon hardware solution, the Wi-Fi transmission method, and the rudimentary initial version of the receiving app provided many usable means, making it a perfect opportunity to learn and practice. The following is a record of the entire process.
Hardware#
After removing the screws on the front, the Yingmou shell can be opened to reveal the hardware solution. Below are the main chips and basic information:
- Main Chip: Hisilicon Hi3516ARBCV100 SoC
- Interface Chip: Lontium IT6801FN HDMI - BT.1120
- RAM: Hynix H5TC2G63GFR DDR3L
- Flash: Macronix MX25L12835F 16MB SPI Flash
- Network Card: Olink 8121N-UH module, Qualcomm Atheros AR1021X chip, 2x2 802.11 a/n 5G, USB 2.0 interface
- MCU: ST STM32F030C8T6
The main chip, Hisilicon Hi3516A, has a single-core Cortex-A7 processor, and the official SDK is based on the Linux 3.4 kernel, focusing on having an H.264/H.265 video hardware codec. Officially positioned as "a professional HD IP camera SoC with an integrated new generation ISP," it is indeed widely used in the surveillance field, revealing the essence of low-cost transmission—by replacing the image sensor input with HDMI input based on surveillance solutions, it becomes a transmission device. The shipment volume of surveillance solutions is very large, and this SoC also removes unnecessary peripherals like displays that are not needed for surveillance, allowing for very low costs. A slightly special aspect is how to convert HDMI input into widely available interfaces on the SoC like MIPI CSI and BT.1120, but at least there are existing ICs to choose from. Similar products with this approach include HDMI encoders that also utilize Hisilicon solutions. Compared to Wi-Fi transmissions that require external Wi-Fi network cards, they can use the SoC's built-in GMAC, needing only to connect an Ethernet PHY chip to achieve wired network connections, capable of capturing HDMI and stably live streaming.
What shocked me most about Yingmou's hardware solution was its mere 16MB of Flash: a device running Linux only needs 16MB of space. However, upon calm analysis, many routers running OpenWRT also only require 16MB or even 4MB of Flash; the demand for space in video processing mainly lies in RAM. As long as one is willing to forgo the rich software packages of distributions like Debian and trim down, a Linux dedicated to specific tasks can occupy very little space.
Board Environment#
Boot#
Using Hisilicon chips, the development will basically follow the official SDK. By connecting wires to the three solder points marked R, T, and G on the board, powering it on reveals the U-Boot and HiLinux boot logs, confirming it is indeed a genuine Hisilicon.
Executing printenv
under U-Boot retrieves the command to boot the kernel and the parameters passed to the kernel. From the output, it can be seen that the layout of the SPI Flash is 1M (boot), 3M (kernel), and 12M (rootfs), with rootfs being a 12MB jffs2 file system. The boot process first probes the SPI Flash (sf) device to obtain Flash information, then reads the kernel of size 0x300000 (3MB) from an offset of 0x100000 (1MB) in Flash into memory address 0x82000000, and finally uses the bootm
command to start the kernel from memory.
bootfile="uImage"
bootcmd=sf probe 0;sf read 0x82000000 0x100000 0x300000;bootm 0x82000000
bootargs=mem=128M console=ttyAMA0,115200 root=/dev/mtdblock2 rootfstype=jffs2 mtdparts=hi_sfc:1M(boot),3M(kernel),12M(rootfs)
After System Boot#
After entering the system, how to find the image transmission program running on the board? According to the Hisilicon development environment user guide, programs that need to run automatically after the system starts can be added to /etc/init.d/rcS
. Therefore, I opened /etc/init.d/rcS
to check.
The main content in rcS
is as follows:
First, the kernel network buffer is modified, with the write buffer set to 0x200000 (2MB) and the read buffer set to 0x80000 (512KB):
#sys conf
sysctl -w net.core.wmem_max=2097152
sysctl -w net.core.wmem_default=2097152
sysctl -w net.core.rmem_max=524288
sysctl -w net.core.rmem_default=524288
Loading the wireless network card driver:
insmod /ko/wifi/ath6kl/compat.ko
insmod /ko/wifi/ath6kl/cfg80211.ko
insmod /ko/wifi/ath6kl/ath6kl_usb.ko reg_domain=0x8349
IP/DHCP configuration:
ifconfig wlan0 10.0.0.1 netmask 255.255.255.0 up
echo udhcpd
udhcpd /etc/wifi/udhcpd.conf &
#echo hostapd
#hostapd /etc/wifi/hostap.conf &
Loading the MPP driver, consistent with the SDK documentation:
cd /ko
./load3516a -i -sensor bt1120 -osmem 128 -online
Loading MPP mainly involves initialization, loading many kernel modules, and the output log is as follows:
Hisilicon Media Memory Zone Manager
Module himedia: init ok
hi3516a_base: module license 'Proprietary' taints kernel.
Disabling lock debugging due to kernel taint
load sys.ko for Hi3516A...OK!
load tde.ko ...OK!
load region.ko ....OK!
load vgs.ko for Hi3516A...OK!
ISP Mod init!
load viu.ko for Hi3516A...OK!
load vpss.ko ....OK!
load vou.ko ....OK!
load hifb.ko OK!
load rc.ko for Hi3516A...OK!
load venc.ko for Hi3516A...OK!
load chnl.ko for Hi3516A...OK!
load h264e.ko for Hi3516A...OK!
load h265e.ko for Hi3516A...OK!
load jpege.ko for Hi3516A...OK!
load vda.ko ....OK!
load ive.ko for Hi3516A...OK!
==== Your input Sensor type is bt1120 ====
acodec inited!
insert audio
==== Your input Sensor type is bt1120 ====
mipi_init
init phy power successful!
load hi_mipi driver successful!
After this, a program named RtMonitor
will run, and all image transmission business logic is implemented within it.
The exploration of the board environment can be said to have gone too smoothly, with no obstacles encountered, and even some debugging comments left in the script. In fact, the Hisilicon SDK documentation provides various encryption methods for each step, such as disabling the serial port, setting the root account password, etc. Any application of these would create considerable trouble.
Transmission#
Packet Capture#
First, I tried to capture packets via Wi-Fi to see what data is being transmitted. Since iOS apps can be installed on ARM architecture Macs, running the Accsoon app and opening Wireshark allows for packet capturing. Three types of packets can be observed:
- Transmission → Receiver UDP: Large data volume, presumed to be the transmission data stream;
- Receiver → Transmission UDP: Very short, presumed to be data acknowledgment packets;
- Receiver → Transmission TCP: Sent when opening the transmission interface, triggering the aforementioned UDP transmission, and then approximately every 0.5-1 seconds, presumed to be heartbeat keep-alive packets, with content containing "ACCSOON".
Additionally, I performed a capture in monitor mode. It can be observed that when multiple devices are connected, due to the lack of high-speed multicast/broadcast mechanisms in Wi-Fi, data needs to be sent separately to each device, significantly increasing channel pressure.
Another disadvantage of Wi-Fi transmission is that if the 802.11 protocol is followed without modifying interframe space and backoff random numbers to gain an unfair advantage in channel contention, this transmission does not have a higher transmission priority compared to other Wi-Fi devices. When many other Wi-Fi devices are active on the channel, it inevitably leads to transmission stuttering. Fortunately, the congestion level on the 5GHz channel is generally better than that on the 2.4GHz.
Decompiling Android APK#
It is still difficult to see the specific content of the data packets through packet capture, especially the meanings of various fields in the header. Therefore, I attempted to analyze the logic within the Zhixun Android app. Since the updated version added more code to support other devices, the older version is more conducive to analysis. I downloaded an older version supporting Yingmou transmission from apkpure (Accsoon 1.2.5 Android APK). Using Jadx to decompile the apk, I mainly looked for the following content:
- Composition of UDP video stream data packets, to correctly parse the video stream;
- Content and sending logic of TCP control commands, to correctly trigger the device to start sending functions.
Analysis of the key logic in the Java code:
- MediaCodecUtil Class
-
Encapsulates operations on Android's native codec interface MediaCodec.
-
In the constructor, MediaCodec is initialized, and from the parameters during initialization, it can be inferred that the decoder used is "video/avc", meaning the transmitted video stream is H.264 encoded.
-
During MediaCodec initialization, the
MediaCodec.configure
method passes aSurface
, andMediaCodec
will output the decoded video frames directly to theBufferQueue
of thatSurface
and call backonFrameAvailable()
. -
The
putDataToInputBuffer
method corresponds to the input buffer of MediaCodec. It will request an empty buffer from the buffer queue, copy the data to be decoded into it, and then place it into the input buffer queue. -
The
renderOutputBuffer
method corresponds to the output buffer of MediaCodec. It will retrieve the decoded data from the output buffer queue and release that buffer.
-
- MediaServerClass Class
- The
Start()
method callsMediaRtms.Start()
andTcpLinkClass.StartMediaStream()
, respectively starting UDP and TCP.H264FrameReceiveHandle
is passed as a callback function when instantiating theMediaRtms
. WhenH264FrameReceiveHandle
is called, it will ultimately callputDataToInputBuffer
andrenderOutputBuffer
inMediaCodecUtil
.
- The
- MediaRtms Class
- A simple encapsulation of the
rtmsBase
class. - The
Start()
method creates aDatagramSocket
and starts audpRxThread
thread. In this thread, it continuously receives data, and after receiving data of a certain length, it parses the packet header. If it is video, it calls theH264FrameReceiveHandle
callback.
- A simple encapsulation of the
- TcpLinkClass Class
- After calling
StartMediaStream()
, it will start aKeepAliveThread
thread. In this thread, it calls a method namedStaOp
in theTcpLinkClass
every second, which implements the process of TCP connection, sending heartbeat packets, and disconnecting.
- After calling
- SurfaceRender Class
- The video is displayed on the
GLSurfaceView
control. InVideoMainActivity
, thesetRenderer
method is called to setSurfaceRender
as the renderer forGLSurfaceView
. - The
onSurfaceCreated
method creates aSurfaceTexture
(mSurfaceTexture
) bound to an OpenGL texture (mTextureId
) to receive video frames decoded byMediaCodec
. It also creates the frame buffer object and texture needed for off-screen rendering, preparing for effect processing. - The
onDrawFrame
method draws the current frame. It calls theupdateTexImage
method to update the latest image frame from theSurfaceTexture
to the bound OpenGL texture. At this point, it switches to off-screen rendering, using a shader program to overlay the video frame texture and LUT texture to apply 3D LUT; it then switches back to normal rendering, using the texture obtained from off-screen rendering to implement effects like zebra stripes and black and white, and displays them; overlay elements like center lines and ratio boxes are drawn separately at the end.
- The video is displayed on the
Let's take a closer look at how long the packet header is and what information it contains.
TCP Frame:
UDP Frame:
Each Message contains a frame of code stream, and each Message has a Message Header before it:
Each Message is divided into several segments Frame sent, and each Frame has a Frame Header before it:
H.264 Stream Extraction#
Knowing the structure of the data packets allows us to start parsing. After reassembling the Message based on Frame segments, it was found that the content of the Message starts with a fixed prefix of 0x000001
, exhibiting characteristics of NALU (Network Abstraction Layer Unit), containing a one-byte NALU Header, with a key focus on nal_unit_type
, which is used to determine the content type in the Payload. Theoretically, at this point, all that is needed is to send the content of the Message to the decoder one by one to decode the video stream. However, it is important to note that the decoder needs to rely on parameters such as profile, level, width, height, and deblock filter saved in SPS and PPS to decode correctly, which must be communicated to the decoder before the I-frame. Therefore, it is best for the code to determine based on nal_unit_type
, waiting for SPS and PPS to be sent to the decoder first.
NAL Header:
NALU Type:
Receiver Design#
Receiving Solution 1 - Computer Reception#
With the packet structure clear, as long as the data packets are correctly received, the H.264 stream can be extracted and sent to the decoder. To facilitate efficient development and debugging, I started on a computer. Using multimedia frameworks like FFmpeg (libav) or GStreamer (libgst), decoding can be easily implemented. I first tried using FFmpeg, which mainly requires the following processes:
- Decoder Initialization: Use
avcodec_find_decoder()
to find the H.264 decoder, useavcodec_alloc_context3()
to create a context, and useavcodec_open2()
to open the decoder. - Data Decoding: Use
av_packet_from_data()
to store the data inAVPacket
, then useavcodec_send_packet()
to send it to the decoder, and useavcodec_receive_frame()
to retrieve the decoded data fromAVFrame
.
The overall logic of the program is roughly as follows:
- Main Thread: Initializes the FFmpeg decoder and SDL display, starts UDP and TCP threads. It then loops waiting for available data signals, decodes the data, and displays it.
- UDP Thread: Receives packets, collects all segments corresponding to each
msg_id
, combines them into a complete message, and places the content into shared memory, notifying the main thread with a signal. - TCP Thread: Periodically sends heartbeat packets.
Connecting to the transmission's Wi-Fi, I ran the software. The transmission connected to the RX0 small camera, which filmed a stopwatch on a phone for a rough latency test. The left side displayed the image after passing through the phone screen → RX0 filming the screen and outputting via HDMI → HDMI input to the transmission → computer wirelessly receiving and displaying from the monitor. The end-to-end latency was roughly around 200ms.
Receiving Solution 2 - Development Board#
With the program running on the computer, there was hope to port it to embedded hardware. I had previously acquired an Allwinner D1 MangoPi MQ-Pro D1, which has HDMI output, an H.264 hardware decoder, and provides a complete SDK and documentation, meeting most of the requirements for creating a receiving end. The downside is that the Wi-Fi network card only supports 2.4GHz, so I had to replace it with an RTL8821CS network card that supports the 5GHz band and compile the corresponding driver to connect to Yingmou's hotspot.
Allwinner provides the Tina Linux SDK for D1. The highlight of Tina is that it is built on a Linux kernel + OpenWRT system, making AIoT products, primarily smart speakers, more lightweight. After all, OpenWRT is better known for its use in routers with very limited memory and storage. The promotion claims that a system that originally required 1GB DDR + 8GB eMMC can now run on just 64MB DDR + 128MB NAND Flash using the Tina Linux system.
The D1 chip has an H.264 hardware decoder, and the Tina system supports the OpenMAX interface of libcedar, allowing GStreamer to use the omxh264dec
plugin to call libcedar for hardware video decoding. Additionally, Tina provides the sunxifbsink
plugin, which can call DE to implement YV12 → RGB. Therefore, using GStreamer for decoding and displaying became the best choice. After configuring the SDK according to this article and resolving various compilation issues, I obtained GStreamer with the aforementioned plugins, and then application development could proceed.
Although I did not think of Solution 2 while working on Solution 1 and used FFmpeg, the TCP control commands and UDP data acquisition parts can be reused. The core of using GStreamer lies in the elements that sequentially form a pipeline. To send the frame data obtained from UDP into the pipeline, the appsrc
of GStreamer can be used. appsrc
provides an API for sending data into the GStreamer pipeline. There are two modes for appsrc
: push mode and pull mode. In pull mode, appsrc
will fetch the corresponding data from the application when it needs data through a specified interface. In push mode, the application actively pushes data into the pipeline. If we adopt the push method, we can actively "send" data into appsrc
in the UDP receiving thread. Therefore, we create a pipeline with the following process:
-
Create elements
appsrc = gst_element_factory_make("appsrc", "source"); parse = gst_element_factory_make("h264parse", "parse"); decoder = gst_element_factory_make("omxh264dec", "decoder"); sink = gst_element_factory_make("sunxifbsink", "videosink");
Each element's properties are set using
g_object_set()
, wherecaps
defines the format and properties of the data stream for correct processing by the elements and negotiation between them. In this application, thecaps
ofappsrc
is the most important; otherwise, subsequent elements will not know what format the received content is. Thecaps
configuration forappsrc
is as follows:GstCaps *caps = gst_caps_new_simple("video/x-h264", "width", G_TYPE_INT, 1920, "height", G_TYPE_INT, 1080, "framerate", GST_TYPE_FRACTION, 30, 1, "alignment", G_TYPE_STRING, "nal", "stream-format", G_TYPE_STRING, "byte-stream", NULL); g_object_set(appsrc, "caps", caps, NULL);
-
Create the pipeline and add and link elements
pipeline = gst_pipeline_new("test-pipeline"); gst_bin_add_many(GST_BIN(pipeline), appsrc, parse, decoder, sink, NULL); gst_element_link_many(appsrc, parse, decoder, sink, NULL);
This forms a pipeline of
appsrc→h264parse→omxh264dec→sunxifbsink
.
In the UDP thread, we still loop to receive, collect all segments corresponding to each msg_id
, combine them into a complete message, place the content into gst_buffer
, and push the buffer gst_buffer
into appsrc
using g_signal_emit_by_name(appsrc, "push-buffer", gst_buffer, &ret)
. In gst_buffer
, besides the frame data itself, dts
, pts
, and duration
are important time parameters that need to be passed. By setting the do-timestamp
property of appsrc
to TRUE
, appsrc
will automatically set timestamps for the buffer when it receives it, but the duration
must be set according to the frame rate. If not set, it was observed that there would be an indescribable "stuttering" feeling, likely due to the lack of duration
settings leading to unstable playback speed. Although it may introduce additional latency, to ensure a good viewing experience, it is still advisable to set it.
To compile the completed code, a Makefile can be written so that our code is packaged as an OpenWRT software package, compiled together when building the rootfs.
Running the software on the board, the transmission connected to the RX0 small camera filming the stopwatch on the screen for a rough latency test. The specific process was left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to transmission → development board wirelessly receiving and outputting via HDMI → HDMI input to the right display. The end-to-end latency was roughly between 200-300ms, which is not low. The good aspect is that watching the video played on the screen through the transmission was relatively smooth.
Testing the camera directly connected to the display, the process was left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to the right display, with latency around 70ms. Therefore, the latency of the transmission itself is roughly between 130ms-230ms.
With this receiving end, it is possible to connect to various monitors via HDMI, allowing Yingmou to not be limited to using smartphones and tablets for monitoring.
Receiving Solution 3 - OBS Studio Plugin#
The first two receiving solutions allow monitoring using a computer and HDMI display devices when using Yingmou, but they still do not meet the demand for low-latency live streaming. If the receiving program from Solution 1 is slightly modified to send the H.264 stream to OBS Studio via UDP on localhost, it will be found that enabling buffering results in smoothness but high latency, while disabling buffering results in low latency but frequent stuttering not seen during monitoring; although Solution 2 can connect to a capture card to collect HDMI output, it increases the latency of decoding, outputting, and the capture card on the development board. To reduce latency, directly developing an OBS plugin is almost the best choice.
OBS Studio supports extending functionality through plugins Plugins — OBS Studio 30.0.0 documentation (obsproject.com). According to the introduction, developing a Source type plugin can connect video sources to OBS. In the development of OBS Studio's Source class plugins, there are synchronous video sources and asynchronous video sources. Synchronous video sources, like Image Source, synchronize with OBS's rendering loop, with OBS actively calling the rendering function of the video source to obtain frame data, suitable for graphic rendering or special effects processing; asynchronous video sources can run in independent worker threads, asynchronously with OBS's rendering loop, actively pushing frame data to OBS. For network streams and camera inputs, asynchronous is more suitable.
Based on the provided plugin template obs-plugintemplate, I established the project, prepared the environment, and completed the logic by referencing the existing image_source plugin source code from OBS. The changes needed are minimal, as most of the code can be reused from Solution 2. The difference is that the decoded frame content cannot be directly sent to a display element; it needs to be obtained through appsink
and called obs_source_output_video()
to pass it to OBS Studio.
After successful compilation, the .so
file from the build
directory is copied to the OBS Studio plugin directory (e.g., /usr/local/lib/obs-plugins/
), and starting OBS Studio allows for testing. Similarly, latency tests were conducted, with the specific process being left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to transmission → computer OBS plugin wirelessly receiving and displaying. The end-to-end latency was roughly around 200ms, and watching the video played on the screen through the transmission was also smooth and coherent.
OBS Studio plugin latency test: left screen display → RX0 filming the screen and outputting via HDMI → HDMI input to transmission → computer OBS plugin wirelessly receiving and displaying.
Connecting all three solutions simultaneously, an increase in stuttering can be perceived, but all still maintain a reasonably acceptable latency.
Conclusion#
To some extent, with the support of powerful codecs and mature wireless technology, achieving real-time video transmission is not that difficult, and the software logic can be very simple and direct. The widespread use of Hisilicon chips with hardware encoders in surveillance and the popularity of 5GHz Wi-Fi have almost inadvertently enabled products like Wi-Fi wireless transmission to achieve good real-time video transmission experiences at extremely low costs. Coupled with the booming development of large-screen devices, the barriers are further lowered. Unfortunately, their upper limits are quite constrained: enjoying the Wi-Fi ecosystem means enduring Wi-Fi congestion; enjoying mature hardware codecs also generally means losing the freedom to make modifications. However, this does not prevent Yingmou itself from being a rather "fully functional" product. It can accomplish its intended tasks well, without serious shortcomings, and despite the emergence of many new products, it can still effectively meet basic transmission needs. Zhixun once introduced the manufacturing process of Yingmou in a live broadcast, and their ability to accurately grasp user pain points early on and thoughtfully create a highly completed product is still commendable.