The EMBLIG data set, a compilation of entries selected from the EMBL database.
File | EMBL Release | Date | Link |
---|---|---|---|
emblig_latest.tar.gz | 20230625 | 22.01.2024 | Download |
File | EMBL Release | Date | Link |
---|---|---|---|
emblig_20240122_r20230625.tar.gz | 20230625 | 22.01.2024 | Download |
emblig_20220723_r20220531.tar.gz | 20220531 | 23.07.2022 | Download |
emblig_20211101_r11Oct2021.tar.gz | 11Oct2021 | 01.11.2021 | Download |
emblig_20210331_r143.tar.gz | 143 | 31.03.2021 | Download |
emblig_20200211_r142.tar.gz | 142 | 11.02.2020 | Download |
emblig_20190405_r138.tar.gz | 138 | 05.04.2019 | Download |
emblig_20180710_r136.tar.gz | 136 | 10.07.2018 | Download |
emblig_20180118_r134.tar.gz | 134 | 18.01.2018 | Download |
emblig_20171017_r133.tar.gz | 133 | 17.10.2017 | Download |
emblig_20170223_r130.tar.gz | 130 | 23.02.2017 | Download |
emblig_20161013_r129.tar.gz | 129 | 13.10.2016 | Download |
The aim is to capture every sequence in the public domain that contains a light or heavy antibody variable domain, using a simple similarity protocol.
To avoid experimentally unverified sequences, pseudogenes etc. the selection is restricted to certain data classes and taxonomic divisions and requires that there is a protein translation.
Every protein translation downloaded from the EMBL release directory is used as a query sequence in a conservative BLAST search of three reference sets of known Ig variable domain sequences.
The reference sets are:
The overwhelming majority of queries (95%) that get a match in one set get a hit in all of them.
Entries with a translation that gets a match are collected and the protein identifiers of the translations are recorded.