Guide To Differentiable Digital Signal Processing (DDSP) Library with Python Code – Analytics India Magazine

Differentiable Digital Signal Processing (DDSP) is an audio era library that makes use of classical interpretable DSP parts (like oscillators, filters, synthesizers) with deep studying fashions. It was launched by Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu and Adam Roberts (ICLR paper). 

Before going into the library’s particulars, allow us to have an summary of the idea of DSP.

What is DSP?

Digital Signal Processing (DSP) is a course of through which digitized alerts resembling audio, video, strain, temperature and so on., are taken as enter to carry out mathematical operations on them, e.g. including, subtracting or multiplying the alerts. Visit this web page for an in depth understanding of DSP.

Overview of DDSP

DDSP library creates advanced life like audio alerts by controlling parameters of straightforward interpretable DSP, e.g. by tuning the frequencies and responses of sinusoidal oscillators and linear filters; it could possibly synthesize the sound of a sensible instrument resembling violin, flute and so on. 



How DDSP works?

Working of DDSP

Image supply: Official documentation

Neural community fashions resembling WaveNet used for audio era can generate waveforms taking one pattern at one level of time. Unlike these fashions, DDSP passes parameters by recognized algorithms for sound synthesis. All the parts within the above determine are differentiable,  the mannequin could be skilled end-to-end utilizing stochastic gradient descent and backpropagation.

Practical implementation of DDSP

DDSP practical implementation

Image supply: Official Colab Demo

Here’s an indication of timbre (pitch/tone high quality) switch utilizing DDSP. The code has been applied utilizing Google colab with Python 3.7.10 and ddsp 1.2.0 variations. Step-wise rationalization of the code is as follows:

  1. Install DDSP library

!pip set up ddsp

  1. Import required libraries and modules.
 import warnings
 warnings.filterwarnings("ignore")
 import copy
 import os  #for interacting with the working system
 import time 
 import crepe
 import ddsp
 import ddsp.training
 from ddsp.colab import colab_utils
 from ddsp.colab.colab_utils import (
     auto_tune, detect_notes, fit_quantile_transform, 
     get_tuning_factor, obtain, play, file, 
     specplot, add, DEFAULT_SAMPLE_RATE)
 import gin
 from google.colab import information
 import librosa
 import matplotlib.pyplot as pl
 import numpy as np
 import pickle
 import tensorflow.compat.v2 as tf
 import tensorflow_datasets as tfds 
  1. Initialize sign sampling fee (default sampling fee of 16000 outlined in ddsp.spectral_ops has been used right here)

sample_rate = DEFAULT_SAMPLE_RATE   

  1. Display choices for the consumer to file an enter audio sign or add one. If recorded, present an choice of choosing the variety of seconds for which recording is to be achieved.
 #Allow .mp3 or .wav file extensions for uploaded file
 record_or_upload = "Upload (.mp3 or .wav)"  #@param ["Record", "Upload (.mp3 or .wav)"]
 “””
 Input for recording’s period can vary from 1 to 10  seconds; it may be modified in step of 1 seconds
 “””
 record_seconds = 20 #@param kind:"number", min:1, max:10, step:1 
  1. Define actions to be carried out primarily based on the consumer’s choice of recording or importing the audio.
 #If consumer selects ‘Record’ choice, file audio from browser utilizing file() technique outlined here
 if record_or_upload == "Record":
   audio = file(seconds=record_seconds)
 “””
 If consumer selects ‘Upload’ choice, enable loading a .wav or .mp3 audio file from disk into the colab pocket book utilizing add() technique outlined here
 “””
 else:
   filenames, audios = add()
 “””
 add() returns names of the information uploaded and their respective audio sound. If consumer uploads a number of information, choose the primary one from the ‘audios’ array
 “””
  audio = audios[0]
 audio = audio[np.newaxis, :]
 print('nExtracting audio options...') 
  1. Plot the spectrum of the audio sign utilizing specplot() technique

specplot(audio)

Create an HTML5 audio widget utilizing play() technique to play the audio file

play(audio)

Reset CREPE’s international state for re-building the mannequin

ddsp.spectral_ops.reset_crepe()

  1. Record the beginning time of audio

start_time = time.time()

Compute audio options

audio_features = ddsp.coaching.metrics.compute_audio_features(audio)

Store the loudness (in decibels) of the audio 

audio_features['loudness_db'] = audio_features['loudness_db']
.astype(np.float32)
audio_features_mod = None 

Compute the time taken for calculating audio options by subtracting begin time from the present time

print('Audio options took %.1f seconds' % (time.time() - start_time))

  1. Plot the computed options
 fig, ax = plt.subplots(nrows=3, 
                        ncols=1, 
                        sharex=True,
                        figsize=(6, 8))
 #Plot the loudness of audio
 ax[0].plot(audio_features['loudness_db'][:-15])
 ax[0].set_ylabel('loudness_db')
 #Plot the frequency of MIDI notes
 ax[1].plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM]))
 ax[1].set_ylabel('f0 [midi]')
 #Plot the confidence of audio sign
 ax[2].plot(audio_features['f0_confidence'][:TRIM])
 ax[2].set_ylabel('f0 confidence')
 _ = ax[2].set_xlabel('Time step [frame]') 

Output:

op1
op2

The .mp3 audio file that now we have used for the demonstration:

(Source of the audio file)

  1. Select the pretrained mannequin of an instrument for use.
 mannequin="Violin" #@param ['Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone', 'Upload your own (checkpoint folder as .zip)']
 MODEL = mannequin 

Define a operate to search out the chosen mannequin

 def find_model_dir(dir_name):
   # Iterate by directories till mannequin listing is discovered
   for root, dirs, filenames in os.stroll(dir_name):
     for filename in filenames:
       if filename.endswith(".gin") and never filename.startswith("."):
         model_dir = root
         break
   return model_dir  
  1.  Select the mannequin for use.
 if mannequin in ('Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone'):
    # Pretrained fashions.
   PRETRAINED_DIR = '/content material/pretrained'
   # Copy over from gs:// for sooner loading.
    !rm -r $PRETRAINED_DIR &> /dev/null
   !mkdir $PRETRAINED_DIR &> /dev/null
   GCS_CKPT_DIR = 'gs://ddsp/fashions/timbre_transfer_colab/2021-01-06'
   model_dir = os.path.be part of(GCS_CKPT_DIR, 'solo_percents_ckpt' % mannequin.decrease())
   !gsutil cp $model_dir/* $PRETRAINED_DIR &> /dev/null
   model_dir = PRETRAINED_DIR
    gin_file = os.path.be part of(model_dir, 'operative_config-0.gin')
 else:
    # User fashions.
    UPLOAD_DIR = '/content material/uploaded'
    !mkdir $UPLOAD_DIR
    uploaded_files = information.add()
    for fnames in uploaded_files.keys():
     print("Unzipping... ".format(fnames))
      !unzip -o "/content/$fnames" -d $UPLOAD_DIR &> /dev/null
    model_dir = find_model_dir(UPLOAD_DIR)
    gin_file = os.path.be part of(model_dir, 'operative_config-0.gin') 
  1.  Load the dataset statistics file
 DATASET_STATS = None
 dataset_stats_file = os.path.be part of(model_dir, 'dataset_statistics.pkl')
 print(f'Loading dataset statistics from dataset_stats_file')
 attempt:
 #Load the dataset statistics file if it exists
   if tf.io.gfile.exists(dataset_stats_file):
     with tf.io.gfile.GFile(dataset_stats_file, 'rb') as f:
       DATASET_STATS = pickle.load(f)
 #throw exception if loading of the file fails
 besides Exception as err:
   print('Loading dataset statistics from pickle failed: .'.format(err)) 
  1. Define a way to parse gin config
 #First, unlock the config quickly utilizing a context supervisor
 with gin.unlock_config():
 #Parse the file utilizing parse_config_file() outlined here
   gin.parse_config_file(gin_file, skip_unknown=True) 
  1. Store the checkpoint information
 “””
 For every file within the checklist containing information of the mannequin listing, add it to the ‘ckpt_files’ if it has checkpoint
 “””
 ckpt_files = [f for f in tf.io.gfile.listdir(model_dir) if 'ckpt' in f]
 #Extract title of the checkpoint file
 ckpt_name = ckpt_files[0].break up('.')[0]
 #Add the checkpoint filename to the trail of mannequin listing 
  1. Check that dimensions and sampling charges are equal
 “””
 gin.query_parameter() returns the worth presently certain to the binding key specified as 
 its parameter. Binding is the parameter whose worth we have to question 
 “””
 #Time steps for coaching course of
 time_steps_train = gin.query_parameter('F0LoudnessPreprocessor
 .time_steps')
 #Number of coaching samples
 n_samples_train = gin.query_parameter('Harmonic.n_samples')
 #Compute variety of samples between successive frames (referred to as ‘hop size’)
 hop_size = int(n_samples_train / time_steps_train)
 #Compute whole time steps and variety of samples
 time_steps = int(audio.form[1] / hop_size)
 n_samples = time_steps * hop_size 
  1.  Create an inventory of gin parameters 
 gin_params = [
     'Harmonic.n_samples = '.format(n_samples),
     'FilteredNoise.n_samples = '.format(n_samples),
     'F0LoudnessPreprocessor.time_steps = '.format(time_steps),
     'oscillator_bank.use_angular_cumsum = True', ]
 Parse the above gin parameters
 #First, unlock the config 
 with gin.unlock_config():
 #Parse the checklist of parameter bindings utilizing parse_config()
   gin.parse_config(gin_params) 
  1. Trim the enter vectors to right lengths 
 #Trip every of the frequency, confidence and loudness to its time step size
 for key in ['f0_hz', 'f0_confidence', 'loudness_db']:
   audio_features[key] = audio_features[key][:time_steps]
 #Trip ‘audio’ vector to the size equal to the full quantity os samples
 audio_features['audio'] = audio_features['audio'][:, :n_samples] 
  1. Initialize the mannequin simply to foretell audio

mannequin = ddsp.training.models.Autoencoder()

Restore the mannequin checkpoints

mannequin.restore(ckpt)

  1. Build a mannequin by working a batch of audio options by it.
 #Record begin time of the audio
 start_time = time.time()
 #Build the mannequin utilizing computed options
 _ = mannequin(audio_features, coaching=False)
 “””
 Display the time taken for mannequin constructing by computing distinction between present time and begin time of audio
 “””
 print('Restoring mannequin took %.1f seconds' % (time.time() - start_time)) 

 Sample output: Restoring mannequin took 2.Zero seconds

  1. The pretrained fashions (Violin, Flute and so on.) weren’t explicitly skilled to carry out timbre switch, so they might sound unnatural if the enter audio frequencies and loudness are very completely different from the coaching information (which shall be true more often than not).

Create sliders for mannequin conditioning

 #@markdown ## Note Detection
 #@markdown You can go away this at 1.Zero for many instances
 threshold = 1 #@param kind:"slider", min: 0.0, max:2.0, step:0.01
 #@markdown ## Automatic
 ADJUST = True #@paramtype:"boolean"
 #@markdown Quiet elements with out notes detected (dB)
 quiet = 20 #@param kind:"slider", min: 0, max:60, step:1
 #@markdown Force pitch to nearest notice (quantity)
 autotune = 0 #@param kind:"slider", min: 0.0, max:1.0, step:0.1
 #@markdown ## Manual
 #@markdown Shift the pitch (octaves)
 pitch_shift =  0 #@param kind:"slider", min:-2, max:2, step:1
 #@markdown Adjust the general loudness (dB)
 loudness_shift = 0 #@param kind:"slider", min:-20, max:20, step:1
 audio_features_mod = ok: v.copy() for ok, v in audio_features.objects() 

The sliders to change conditioning seem within the colab as follows:

See Also

Bamboolib For visualizing pandas
op3
  1. Define a way to shift loudness 
 def shift_ld(audio_features, ld_shift=0.0):
 #Increment the loudness by ld_shift
   audio_features['loudness_db'] += ld_shift
 #Return modified audio options
   return audio_features  
  1. Define a way to shift frequency by numerous octaves
 def shift_f0(audio_features, pitch_shift=0.0):
 #Multiply the frequency by 2^pitch_shift
    audio_features['f0_hz'] *= 2.0 ** (pitch_shift)
   audio_features['f0_hz'] = np.clip(audio_features['f0_hz'],0.0,
       librosa.midi_to_hz(110.0))
   return audio_features 
  1. Detect the sections of audio that are ‘on’
 if ADJUST and DATASET_STATS just isn't None:
 #Store the loudness, confidence and notes of ‘on’ sections
    mask_on, note_on_value = detect_notes(audio_features['loudness_db'],
    audio_features['f0_confidence'],threshold) 

Quantile shift the elements with ‘on’ part

     _, loudness_norm = colab_utils.fit_quantile_transform(
         audio_features['loudness_db'],mask_on,
         inv_quantile=DATASET_STATS['quantile_transform']) 

Turn down the elements of audio with ‘off’ notes.

   #If mask_on just isn't True, assign that notice as ‘mask_off’
   mask_off = np.logical_not(mask_on)
 #In the normalized loudness’ array, compute the loudness of such off notes
   loudness_norm[mask_off] -=  quiet * (1.0 - note_on_value[mask_off][:, np.newaxis])
 #Reshape the normalized loudness’ array   
 loudness_norm = np.reshape(loudness_norm, audio_features['loudness_db'].form)
 #Update the loudness (in dB) to the normalized loudness
 audio_features_mod['loudness_db'] = loudness_norm 
 #If ‘autotune’ is chosen utilizing the slider widget
   if autotune:
 #Frequency (Hz) to MIDI notes conversion
       f0_midi = np.array(ddsp.core.hz_to_midi
       (audio_features_mod['f0_hz']))
 #Get an offset in cents, to most constant set of chromatic intervals
       tuning_factor = get_tuning_factor(f0_midi,  
       audio_features_mod['f0_confidence'], mask_on)
 #Reduce variance of the frequency from the chromatic or scale intervals
       f0_midi_at = auto_tune(f0_midi, tuning_factor, mask_on, 
       quantity=autotune)
  #Store the frequency in Hz by changing MIDI notes to Hz
       audio_features_mod['f0_hz'] = ddsp.core.midi_to_hz(f0_midi_at)
  """
  Display correct message if ‘ADJUST’ choice is deselected or no notes are   
  detected
  """
 else:
     print('nSkipping auto-adjust (no notes detected or ADJUST field empty).'
"""
Display message if ‘ADJUST’ field just isn't checked or dataset statistics file just isn't discovered  
"""
   else:
     print('nSkipping auto-adjust (field not checked or no dataset 
     statistics discovered).') 
  1. Perform guide shifts of loudness and frequency utilizing strategies outlined in step (20) and (21)
 audio_features_mod = shift_ld(audio_features_mod, loudness_shift)
 audio_features_mod = shift_f0(audio_features_mod, pitch_shift) 
  1. Plot the options
 #Check if ‘on’ notes has masks
 has_mask = int(mask_on just isn't None)
 #Three subplots if ‘has_mask’ is 1(True), else solely 2 subplots of loudness and frequency
 n_plots = Three if has_mask else 2 
 #Initialize the determine and axes parameters
 fig, axes = plt.subplots(nrows=n_plots, 
                       ncols=1, 
                       sharex=True,
                       figsize=(2*n_plots, 8))
 #Plot the masks of ‘on’ notes, if exists
 if has_mask:
    ax = axes[0]
    ax.plot(np.ones_like(mask_on[:TRIM]) * threshold, 'ok:')
    ax.plot(note_on_value[:TRIM])
    ax.plot(mask_on[:TRIM])
    ax.set_ylabel('Note-on Mask')
    ax.set_xlabel('Time step [frame]')
    ax.legend(['Threshold', 'Likelihood','Mask'])
 
#Plot the unique and adjusted loudness
 ax = axes[0 + has_mask]
 ax.plot(audio_features['loudness_db'][:TRIM])
 ax.plot(audio_features_mod['loudness_db'][:TRIM])
 ax.set_ylabel('loudness_db')
 ax.legend(['Original','Adjusted'])
 #Plot the unique and adjusted frequencies
 ax = axes[1 + has_mask]
 ax.plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM]))
 ax.plot(librosa.hz_to_midi(audio_features_mod['f0_hz'][:TRIM]))
 ax.set_ylabel('f0 [midi]')
 _ = ax.legend(['Original','Adjusted']) 

Output:

op4
  1. Resynthesize the audio 

Store the computed audio options first

af = audio_features if audio_features_mod is None else audio_features_mod

Run a batch of predictions

 #Record the beginning time of audio
 start_time = time.time()
 #Apply the mannequin outlined in step (17) utilizing the computed audio function
 outputs = mannequin(af, coaching=False) 

Extract audio output from outputs’ dictionary

audio_gen = mannequin.get_audio_from_outputs(outputs)

Display the time taken for making predictions by computing distinction between present time and begin time of enter audio

print('Prediction took %.1f seconds' % (time.time() - start_time))

  1. Plot the HTML5 widget for enjoying the unique and resynthesized audios in addition to spectrum of each the alerts
 print('Original')
 play(audio)
 print('Resynthesis')
 play(audio_gen)
 specplot(audio)
 plt.title("Original")
 specplot(audio_gen)
 _ = plt.title("Resynthesis") 

Output widgets:

op5

Output plots:

op6
op7

Original audio:

Resynthesized audio (utilizing ‘Violin’ mannequin):

Google colab pocket book of the above implementation is accessible here.

References


Subscribe to our Newsletter

Get the newest updates and related provides by sharing your electronic mail.


Join Our Telegram Group. Be a part of a fascinating on-line neighborhood. Join Here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here