This idea is inspired from applications such as photoshop, where a user can
select an object and use a brush tool to pick the color of the object
and transfer the color to another object. With selective semantic attribute transfer,
we would like to select a semantic attribute (such as say brightness) from a reference
sample
and transfer it to the target sample without changing other attributes or the time domain
structure of the original sample.
Example 1: Transfer
Brightness from Brightness Reference Sample
(row 2, column 1) to Target Sample (all top row samples).
Note that only brightness is transferred/changed in the resulting samples. Other attributes
such as location of impacts (time axis), rate and impact type
remain the same as in Target Sample.
Example 2: Transfer Impact
Type from Impact Type Reference Sample
(row 2, column 1) to Target Sample (all top row samples).
Note that only impact type is transferred/changed in the resulting samples. Other attributes
such as location of impacts (time axis), rate and brightness
remain the same as in Target Sample.
Example 3: Transfer
Rate from Rate Reference Sample
(row 2, column 1) to Target Sample (all top row samples).
Note that only rate is transferred/changed in the resulting samples. Other attributes such as
location of the first impact (along time axis), brightness
and impact type
remain the same as in Target Sample.
SeFa failure mode for Greatest Hits - Impact Type
Though SeFa Dimension 2 (See Table 3 in paper) records the highest rescoring accuracy for the
change in impact type in sounds, we can qualitatively
see that this dimension can change only scratches to sharp hits. Sharp hits, on the other hand,
either do not change or go out-of-distribution (OOD)
instead of changing to scratches.
The leftmost samples below are sharp hits. As we go progressively towards the right the samples
either do not change or go OOD.
SeFa failure mode for Water - fill-level
For the Water dataset, SeFa finds the second dimension (see Table 3 in paper) to be associated
with fill-level.
The most prominent dimension - i.e., the dimension with the highest Singular Value (dimension 0)
- is associated with
"perpetually filling" of the water container, while never reaching fill-level=10 (or full
container).
The other dimension (dimension 2) is associated with "perpertually unfilling" of the
water
container, while never leaving fill-level=10 or never reaching fill-level=0.
Whimsicial, sure, but not controllable. Makes one wonder, especially for dimension 0 which has
the highest Singular Value
(from the eigendecomposition of weights), if not fill-level then what
other factor of variation did SeFa find?
sample 1
(Dim0. Perpetually
filling.
No ending.)
sample 2
(Dim2. Perpetually
Unfilling.
No starting.)
Ablation studies for number of examples needed for guidance - Gaver Sound Synthesis
In this section, we evaluate the effect of the number Gaver samples used to find the directional
vectors for edits for the attributes of Brightness and Impact Type.
The first column shows the number of samples used across clusters.
The Brightness or Impact Type changes from left to right. As observed, as the N increases, the
effectiveness of the directional vectors edits increases.
Also, the edits preserve other un-edited attributes better with higher N.
For instance, for Brightness, for all N<10 the Rate is not preserved.
Also, for N<6, the Impact Type is partially preserved (some hits become scratches).
Brightness (reduces from left to right)
Impact Type (sounds become scratchier from left to right)
StyleGAN2 Training Details
We set Z and W space dimensions both to 128 for all our
experiments across the two datasets. We also use only 4 mapping layers (as compared to 8 in the original StyleGAN2 paper).
Further, we use the log-magnitude spectrogram
representations generated using a Gabor
transform(n_frames=256, stft_channels=512,
hop_size=128), a Short-Time Fourier Transform (STFT) with a Gaussian window, to train the
StyleGAN2 and the Phase Gradient Heap Integration (PGHI) for
high-fidelity spectrogram inversion of textures to audio. For Greatest
Hits dataset, we train the models for 2800kimgs with a batch size of 16, taking ~20 hours
to train on a single RTX 2080 Ti GPU with 11GB memory. For Water Filling dataset, we train for
1400kimgs with the same batch size running for ~14 hours on the same GPU. The metrics for
quality of the generated sounds in terms of Frechet Audio Distance
(FAD) along with the StyleGAN2 code adapted for audio textures can be
found below.
Table: StyleGAN2 Frechet Audio Distance (FAD)
Dataset |
w-dim & z-dim |
Number kimgs (iterations) |
FAD Score |
Greatest Hits Dataset |
128 |
2800 |
0.51 |
Water Filling Dataset |
128 |
1400 |
1.87 |
Encoder/GAN Inversion Training Details
We use a RestNet-34 backbone as the architecture for our GAN inversion
network. For both datasets, we train the Encoder for 1500 iterations with a batch size of 8 and
choose the checkpoint with the lowest validation loss for inference. As in the original GAN
Encoder paper we use an Adam optimizer with learning rate of 0.0001.
Further, for the Water dataset, we apply a thresholding of -17db, i.e. we mask the frequency
components with magnitude below -17db. For both datasets, the training took $\sim$ 25 hours to
complete on a single GPU.
Table: GAN Inversion/Encoder (netE) Frechet Audio Distance (FAD)
Dataset |
Number kimgs (iterations) |
Gaver Sounds FAD Score |
Real-World Sounds FAD Score |
Greatest Hits Dataset |
1500 |
4.73 |
0.586 |
Water Filling Dataset |
1500 |
7.5 |
3.35 |
To model impact sounds, such as those in the Greatest Hits dataset, we use a combination of
sounds synthesized using method 1 and 2. For method 1, we choose damping constant
δn=0.001 for hard surfaces and δn=0.5 for soft surfaces. We
provide variations
in the generated sounds by using different impact surface sizes, φ and n (number of
partials). We vary the first partial of ω between 60-240Hz for large impact surfaces and
between 250-660Hz for smaller surfaces. For method 2, we vary the impulse width of each impact
between 0.4 - 1.0 seconds to model scratches and between 0.1 - 0.4 seconds to model sharp
hits. Further, we model dull sounds by configuring low frequency bands roughly between
10-1.5kHz and bright sounds using frequency bands above 4kHz. Water filling Gaver sounds are
modelled as a combination of multiple impulses (modelled as individual water drops) concatenated
with each other. We generate each drop using method 1 with an impulse width of 0.05 seconds.
Each fill-level is controlled by linearly increasing or decreasing ω and its partials
across the sound sample.
In all our experiments, we use 10 synthetic Gaver examples (5 per semantic attribute cluster) to
generate the guidance vectors for controllable generation. Please see the
for ablation studies section of this webpage for generating guidance vectors using different number of Gaver samples. Our
code repository has all the Gaver configurations we used in our experiments.
Re-scoring Classifier Details
We use a classifier based on
this paper.
For the classifier we use a DenseNet (with pre-training) based network, where the last layer is
modified
depending upon the number of classes we need. For binary re-scoring analysis the number of
classes is 2 (to
indicate presence of absence of the semantic attribute).
The input to the classifier are 3 mel-spectrograms appended along the channel axis. As in the
original paper, we follow this method
to capture information at 3 different time scales, i.e. we
compute mel-spectrogram of a signal using different window sizes and hop-lengths of [25ms,
10ms], [50ms, 25ms], and [100ms, 50ms] for each channel respectively. The different window sizes
and hop-lengths ensure the network has different levels of information from the frequency and
time domain on each channel. We use Adam optimizer
while training with a learning rate of 0.0001, weight decay of 0.001 and train for 100 epochs.
Curated Dataset for Training Re-Scoring Classifier
To train the attribute re-scoring classifier, we manually curate and label a small dataset, from
both Greatest Hits and Water Filling datasets.
We use this classifier to quantitatively evaluate the
effectiveness of our method in comparison with SeFa. For this, we manually curate approximately
250 samples of 2
seconds sounds for each semantic attribute from both the datasets. This manual curation involved
visually analysing the video and auditioning the associated sound files to detect the semantic
attribute being curated. For Perceived Brightness, we manually curated a set
of bright sounds from hits made on dense material surfaces such as glass, tile, ceramic or metal
to indicate the presence of the brightness attribute. And a set of dull or dark sounds made by
impacts on soft materials such as cloth, carpet or paper to indicate an absence of the
brightness attribute. For Rate we curated sound samples with just 1-2 impact
sounds in a sample to indicate low-rate and all other samples as high-rate. For
Impact Type, we curated a set of sound samples where the drumstick sharply hit
the surface and another set of sounds where the drumstick scratched the surface. For
Fill-Level for Water, we curated the sounds by sampling the first and last ~3
seconds of each file from the original dataset
(of ~30 seconds length files) to indicate
an empty bucket and a full bucket. These datasets are used to train attribute
change or re-scoring classifiers during evaluation.