Polyphonic pickup, audio device

repository: polypu

Although the ability to make various one-shot measurements is necessary to evaluate performance, it is secondary to the ultimate function of the pickup, which is to acquire and emit audio data in real time. Up until now, the device was configured to act as a CDC (communication device class) device, and while that does give us the ability to send and receive arbitrary data, it is not well suited for streaming low-latency, high-bandwidth audio, especially if we consider that the transfers for the payload data are bulk rather than isochronous type transfers. Although I have experimented with recording a single channel with a companion C program and CDC before, there is another device class that is more appropriate for this purpose.

The device class that defines the protocols for transferring audio data over USB is most commonly called UAC (USB audio class), although the specification refers to this class of devices as ADC (audio device class). Endpoints used for streaming audio use isochronous transfers intended for low-latency, high-bandwidth communication. In our case, this communication must occur in parallel with those that occurred before. While it could be possible to configure the pickup to act as a CDC and UAC device separately, and to switch between those based on an external action (such as a button press), we thankfully don’t have to.

To use both of these modes of communication simultaneously, we can configure multiple interfaces for our device, making it a composite one. If our hardware supports enough endpoints, we can expose all those sets of them that are required for each device class. To start with, we’ll have to demonstrate the capacity to function as either of these device classes. After that, we can figure out how to integrate them into a single configuration.

Searching for signs of life

After realizing the difficulty of integrating a new device type into an existing configuration, it seemed like I should start by getting any kind of example code to compile in a fresh project, seeing as there was one called usb_device_uac in the esp-iot-solutions repo. Shortly after, the output of lsusb showed that the device managed to enumerate, and arecord -l also listed it as an audio device. The following command helped with recording the output from the capture device.

$ arecord -D hw:4,0 -f S16_LE -r 48000 -c 1 -d 2 ~/test.wav

The above recorded from the zeroeth subdevice of the fourth hardware device, with a sample format of signed 16-bit little-endian, a sampling rate of 48 kHz, from a single channel, for a duration of two seconds. If we need conversions such as resampling, channel, or format conversions, it’s also possible to specify plughw instead hw. Those details are known here, though, so we can also specify them exactly.

However, trying to output a saw wave resulted in a jagged mess, with the data resembling portions of a saw wave at the level of blocks, but tearing randomly at block boundaries. Taking a look at the component’s implementation, something seemed off almost immediately. Here’s the loop executed by the microphone task.

while (1) {
    if (s_uac_device->mic_active == false) {
        // clear the notification
        ulTaskNotifyTake(pdTRUE, portMAX_DELAY);
        continue;
    }
    // clear the notification
    // read data from the microphone chunk by chunk
    size_t bytes_require =
        MIC_INTERVAL_MS * s_uac_device->mic_bytes_per_ms;
    if (s_uac_device->user_cfg.input_cb) {
        size_t bytes_read = 0;
        esp_err_t ret = s_uac_device->user_cfg.input_cb(
            (uint8_t*)s_uac_device->mic_buf_write, bytes_require,
            &bytes_read, s_uac_device->user_cfg.cb_ctx);
        if (ret != ESP_OK) {
            ESP_LOGE(TAG, "Failed to read data from mic");
            continue;
        }
        int16_t* tmp_buf = s_uac_device->mic_buf_write;
        UAC_ENTER_CRITICAL();
        s_uac_device->mic_buf_write = s_uac_device->mic_buf_read;
        s_uac_device->mic_buf_read = tmp_buf;
        s_uac_device->mic_data_size = bytes_read;
        UAC_EXIT_CRITICAL();
    }
}

It seems like it’s just… repeating as fast as it possibly can, invoking our input callback, taking data from it as fast as it can make it. After the callback returns, the pointers of a double buffer are swapped inside a critical section, and that’s it. The place that takes from that double buffer is a pre-load callback of tinyusb called tud_audio_tx_done_pre_load_cb, which is executed at every block interval, meaning 1 kHz for a 1 ms interval.

Apparently, the component expects the input callback to block for an exact amount of time, because adding a busy-wait in the callback seemed to produce an almost continuous output waveform. Not quite, though, as it still seemed to tear every few seconds. Blocking for an exact amount of time is a really weird expectation to put on a callback, and I’m not sure why it was done here. As an experiment, I tried moving the invocation of the callback and the buffer swap into the tinyusb callback, and that immediately yielded a continuous output waveform.

After this experimental fix, and after processing the… questionable functionality of the component, I tried figuring out the maximum amount of data I could push to the host with it. The ESP32-S2/S3/C3 chips come with a USB-OTG peripheral, supporting full-speed USB at 12 Mbps. After reading rumours that the tinyusb stack on the ESP32 can only utilize around 6.1 Mbps of that bandwidth, I was surprised to see that it could seemingly output not just 6.144 Mbps (8 channels at 48 kHz, 16 bit), but even 7.168 Mbps (8 channels at 56 kHz, 16 bit) of payload data. Although I haven’t tested these too thoroughly, so take them with a grain of salt. Either way, these results sound hopeful if we consider that most guitars would only need 4.608 Mbps to output six channels at the same sampling rate and format.

An extra tidbit of terror is that going beyond four channels was restricted in the component by a Kconfig option, for reasons currently beyond my understanding. Poring over the code, I could find no limitation that would constrain the channel count of the microphone to less than eight.

Composite integration

With the ability to output audio having been demonstrated to some degree, it was time to try and integrate it with CDC. However, there seemed to be a mismatch between how the latter was configured previously, with the convenience functions of esp-idf such as tinyusb_driver_install and tusb_cdc_acm_init, while the UAC example configured tinyusb much more directly, defining a heap of macros, descriptors, and callback functions for it. Luckily, tinyusb included an example of a composite CDC/UAC device, because, as it turns out, this composition is quite common in homebrew radio hacking, and they maintain the example with the aim of providing an entry point for those starting out in that space.

Comparing the UAC example with the composite example was a fruitful source from which it was possible to figure out how to implement a composite example with tinyusb in general, and how to build it as a component of esp-idf specifically. I found the organization of the UAC example to be a bit hard to grasp though, so I tried laying things out a little differently in my component.

.
├── CMakeLists.txt        -- build system integration
├── idf_component.yml     -- component dependencies
├── include
│   ├── usb_cdc.h         -- CDC callback signatures
│   ├── usb_cdc_uac.h     -- composite init declaration
│   └── usb_uac.h         -- mic-related UAC config macros
├── src
│   ├── usb_cdc.c         -- CDC callback implementations
│   ├── usb_cdc_uac.c     -- composite initialization
│   └── usb_uac.c         -- UAC callback implementations
└── tusb
    ├── tusb_config_cdc.h -- generic CDC config macros
    ├── tusb_config.h     -- common & board-specific macros
    ├── tusb_config_uac.h -- generic UAC config macros
    └── tusb_desc.c       -- device, config & string descriptors

Integration happened in two steps: regaining CDC at first, and grafting UAC onto it second. The former allowed me to deal with board-specific things first, and was easier to verify, which let me concentrate on the caprices of the less trivial device class with the latter. Also, there was a necessary detour during which I tried to excise any speaker-related definitions from the UAC example, to leave only those relevant to a microphone.

Shedding some complexity

One more thing that left me feeling uncertain was the mechanism for emitting the output audio. As I mentioned above, I was still using the pre-load callback to invoke an input callback, and to put the block of samples it wrote into the output fifo of tinyusb. The more I thought about scheduling and synchronization, the more untenable this scheme started to seem.

In this approach, the callback will start to be invoked at a fairly constant frequency once recording starts. I say fairly constant, because while trying to toggle some GPIOs to understand timings, I noticed that the first two callbacks often had noticeably less than a millisecond elapse between them. Other than that, it should be safe to assume that, in a steady state, the callback of tinyusb will consume data at a consistent rate.

That data has to be buffered somehow, as it is produced by code that executes independently from the pre-load callback. This is a classical synchronization problem, where a single writer is producing data, and a single reader is consuming it. If we assume that both of them do these at the same average data rate, we can use a bounded buffer to store the data that passes between them. The problems here are with underruns and overruns caused by jitter (slightly inexact data rates on the scale of short intervals of time), and latency (the time that passes between the data entering and leaving the buffer).

Thinking about jitter, and how it influences various flows of execution made it seem like there’s a rat’s nest of potential problems that could impact the correctness of the mechanism. Especially once we consider synchronization, and all the different methods in which it could be implemented to match the phase of the input and output sides. All that while contemplating the latency that each method would add, and trying to figure out how resistant they would be to underruns and overruns.

After staring into this abyss of complexity for a few days, it occurred to me that some of it could be self-inflicted. I kept trying to figure out how to refine the performance of the approach based on the pre-load callback, when that was something the code has inherited from the example component, I didn’t come up with it myself. After all, if we have access to the output fifo of tinyusb in the context where the output data is produced, what’s the point of adding an extra layer of buffering just so that data could enter that fifo in the pre-load callback? Seems like that approach just adds an extra block or two of latency, as well as more points of failure that can make the output stream lose continuity. Passing to the output fifo in the producing context seems to work so far, and I guess we’ll see how it holds up as it sees more use.

while (true) {

    if (!ulTaskNotifyTake(pdTRUE, portMAX_DELAY)) { continue; }

    if (shell.state_lock) { continue; }

    // Pass sample block to the shell
    shell_recv_block(&shell, &state->buffer.blocks[BUFFER_INDEX],
                     &state->output);

    // Linearize block of output data into the expected format
    block_linearize(&state->output, state->linear);

    // Increment buffer index circularly
    BUFFER_INDEX = (BUFFER_INDEX + 1) % state->buffer.capa;

    // If the pre load callback ran too long ago
    esp_cpu_cycle_count_t now = esp_cpu_get_cycle_count();
    if (now - PRE_LOAD_LAST > 160 * 1000 * 10) {

        // Set tracker envelopes to start closing
        for (int i = 0; i < state->params.channels; ++i) {
            envelope_set_dir(&state->trackers[i].env, -1);
        }
    }

    // Attempt to get fifo for the in endpoint
    tu_fifo_t* sw_in_fifo = tud_audio_get_ep_in_ff();
    if (!sw_in_fifo) { continue; }

    // If the tinyusb fifo can't take the output data, skip
    uint16_t need =
        state->output.capa * state->output.chan * sizeof(int16_t);
    if (tu_fifo_remaining(sw_in_fifo) < need) { continue; }

    // Write output audio data
    tud_audio_write(state->linear, need);
}

I’ve included the loop executed by the producing RTOS task above, even if it’s imperfect, and incomplete without context, just to show the rough idea and some of the specific methods it uses. As you can see at the top, it uses the task notification mechanism of FreeRTOS, which - after a bit of testing - seems to be capable of resuming a task at a precision that is independent of scheduler ticks.

There would be many other ways of synchronizing sampling and sample processing, and this is just one of them. Synchronizing between different flows on one or more cores could rely on platform-specific mechanisms, or it could be made independent of them. It could use the tools provided by an RTOS, or it could avoid those explicitly. It would even be possible to sacrifice two GPIO pins for this purpose in some applications, inducing an edge transition on an output pin, and setting an interrupt to occur on that edge on the input one. This is just to say that there are usually a variety of tools accessible on each platform, and it’s worth experimenting with them to figure out the most appropriate one for each situation.

Also, the envelope above refers to a delayed up-down ramp, which is meant to prevent tearing artifacts at the start of recording. Tearing is something we have to be less worried about if we trust tinyusb to empty the output fifo once recording stops (leaving no stale data in the fifo), but popping could still occur if a channel is at a large negative or positive swing at the start of the first block that reaches the host. I’m not sure if there are mechanisms at different stages that could be relied upon to prevent this, I just added an envelope to deal with it because it wasn’t too difficult to do so.

created

modified