Real-world applications

Notes:

Summary:

  1. Total meme of a paper. Included very neat pretraining and data preperation steps, though.

  2. Focused on training an img2text model on a task where there is no real dataset, paper details the steps they took to create one.

  3. Weakly supervised model.

Medium level:

Figure 1

Figure 1. Example outputs

The goal of this paper was to generate critical textual feedback for photographs of many different modalities (but all lying within ‘artistic photography’). There are datasets for similar purposes (namely images paired together with comments), however, the vast majority of these comments aren’t critical, or are grammatically incorrect. The authors seem to delete or automagically correct these, but it’s not exactly clear in the paper.

The authors decide to quantify captions for a given image based off how ‘informative’ the caption is. They first broke the vocabulary into unigrams and bigrams; bigrams consisted of ‘descriptor-object’ pairs, such as ‘nice colors’, ‘too small’, ‘distracting background’, etc.. They then look at the TF-IDF of words in the vocabulary.

Each n-gram, \(\omega\), is then assigned a probability \(P\), as:

\[P(\omega) = \frac{C_\omega}{\sum^D_{i=1} C_i}\]

where D is the vocabulary size \(C_\omega\) is the corpus frequency of the n-gram \(\omega\).

The authors then represent a comment as the union of the unigrams (\(u_i\)) and bigrams (\(b_i\)), with a given sequence \(S = (u_1, ..., u_N) \cup (b_1, ..., b_M) = S_u \cup S_b\). A comment is assigned and informativeness score, \(\rho\), as:

\[\rho_s = -\frac{1}{2} \bigg[ \log \prod^N_i P(u_i) + \log \prod^M_j P(b_j) \bigg]\]

(2) is just the average of the negative log probability of \(S_u\) and \(S_b\).

The score of a comment is created under the assumption that all n-grams are independent. As such, if the n-grams in a sentence have higher corpus probabilities, then the corresponding \(\rho\) score is low, due to the negative logarithm, and vice versa.

\(\rho\) will be higher for longer comments than others, however, long comments without ‘informative’ words are still discarded.

Comments below a certain threshold (\(\rho = 20\)) were deleted. Roughly 55% of the \(\sim 3\)m corpus were deleted. These comments in hand, we still cannot train the CNN efficiently. There are \(\sim\) 25k n-grams, many redundant. The authors then cluster semantically similar n-grams, such as ‘face’ and ‘ear’, ‘sky’ and ‘cloud’, etc..

The authors then use a technique called latent Dirichlet allocation.

Latent Dirichlet Allocation (LDA):

Given a set of documents, \(\mathcal{D} = \{ D_1, ..., D_N \}\), vocabulary of words, \(\mathcal{W} = \{ w_1, ..., w_M \}\), the task is to infer K latent topics \(\mathcal{T} = \{ T_1, ..., T_k \}\), where each topic can be seen as a collection of words, and each document can be seen as a collection of topics. This is usually done via a variational approximation.

The authors set \(K = 200\), then used these topics as labels for the CNN. Once the CNN was trained, they simple used it as a feature extractor for the LSTM, which took it as input and tried to predict the ground truth caption.

image{width=”12cm”}