Fine-grained classification is a complex classification problem in which the objective is to distinguish between classes that are very similar to each other. The ability of triplet loss to model relations between samples makes it a good alternative for fine-grained settings. In this work, we propose to create an adaptive three-level hierarchy of samples in order to exploit this information via multi-level triplet contrast. The negatives and the positives are sampled from a queue, which allows higher control over the variety and computational cost of sampling. We take advantage of cross-modal information thanks to a Universal Sentence Encoder to seamlessly find similar categories and group them together. In addition, we use K-means to dynamically find subclasses within fine-grained categories. Experiments show that the proposed method results in significant improvements in the accuracy of two popular fine-grained classification benchmarks. The results include an improvement of +0.58 in CUB-200-2011.