Databases API¶
API for different linguistic databases can be accessed with
lingtypology.datasets.
import lingtypology.datasets
1. General¶
Lingtypology attempts to provide unified API for given language databases. Therefore, classes in this module share some common attributes and methods. In this paragraph I will describe them and provide examples for Autotyp, Wals and Phoible.
from lingtypology.datasets import Autotyp, Wals, Phoible
1.1. features_list¶
You can get the list of available features from the database using this attribute.
Autotyp().features_list[:10] #It's cutoff in order not to take took much space
['Agreement',
'Alienability',
'Alignment',
'Alignment_case_splits',
'Alignment_per_language',
'Clause_linkage',
'Clause_word_order',
'Clusivity',
'GR_per_language',
'Gender']
Note: Phoible has no features_list attribute because there
are no features. However, it has subsets_list that shows list of
available subsets of Phoible data.
Phoible().subsets_list
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']
1.2. get_df and get_json¶
These two methods access the database and return data as
pandas.Series or dict. Example of usage:
Autotyp('Agreement', 'Clusivity').get_df().head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
| language | LID | VPolyagreement.Presence.v2 | VPolyagreement.Presence.v1 | InclExclAsPerson.Presence | InclExclAny.Presence | InclExclType | InclExclAsMinAug.Presence | |
|---|---|---|---|---|---|---|---|---|
| 0 | Ambulas | 6 | False | False | False | False | no i/e | False |
| 1 | Abkhazian | 7 | True | True | False | False | no i/e | False |
| 2 | Acehnese | 9 | True | False | False | True | plain i/e type | False |
| 3 | Western Keres | 10 | True | True | False | False | no i/e | False |
| 4 | Hokkaido Ainu | 12 | True | True | False | True | plain i/e type | False |
Phoible and Autotyp you can use strip_na
parameter (list, default: []) to strip rows in which there is
empty cell in the given columns. Compare the following.strip_na (empty cells are replaced with '~N/A~'):Phoible().get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
| contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
| 1 | KOREAN (UPSID 423) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 32 | 21 | 11 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2170.html | https://phoible.org/languages/kore1280 |
| 2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
| 3 | KET (UPSID 399) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 25 | 18 | 7 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2706.html | https://phoible.org/languages/kett1243 |
| 4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
tones column given to strip_na:
Phoible().get_df(strip_na=['tones']).head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
| contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
| 2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
| 4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
| 6 | Kabardian (SPA 4) | Kabardian | (43.5082, 43.3918) | kaba1278 | Eurasia | 56 | 49 | 7 | 0 | https://archive.org/details/kbd_SPA1979_phon | https://phoible.org/languages/kaba1278 |
| 8 | Georgian (SPA 5) | Georgian | (41.850396999999994, 43.78613) | nucl1302 | Eurasia | 35 | 29 | 6 | 0 | https://archive.org/details/kat_SPA1979_phon | https://phoible.org/languages/nucl1302 |
Note: By default when you call get_df or get_json it prints
the citation. If you want to disable it, you shoud set the
show_citation to False.
p = Phoible()
p.show_citation = False
p.get_df(strip_na=['tones']).head()
| contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
| 2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
| 4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
| 6 | Kabardian (SPA 4) | Kabardian | (43.5082, 43.3918) | kaba1278 | Eurasia | 56 | 49 | 7 | 0 | https://archive.org/details/kbd_SPA1979_phon | https://phoible.org/languages/kaba1278 |
| 8 | Georgian (SPA 5) | Georgian | (41.850396999999994, 43.78613) | nucl1302 | Eurasia | 35 | 29 | 6 | 0 | https://archive.org/details/kat_SPA1979_phon | https://phoible.org/languages/nucl1302 |
1.3. citation¶
You can get the citation for each database using citation attribute.
E.g.:
from lingtypology.datasets import Autotyp
print(Autotyp().citation)
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
Note: if you use Wals, citation will be shown for every feature.
If you want general citation for the whole Wals, use
general_citation.
w = Wals('1a', '2a')
print(w.citation)
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-06-13.)
Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-06-13.)
print(w.general_citation)
Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013.
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info, Accessed on 2019-06-13.)
2. Wals¶
It is possible to access Wals data (online) using
lingtypology.datasets.Wals
from lingtypology.datasets import Wals
wals_page = Wals('1a', '2a').get_df()
wals_page.head()
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-06-13.)
Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-06-13.)
| wals_code | language | genus | family | coordinates | _1A_area | _1A | _1A_num | _1A_desc | _2A_area | _2A | _2A_num | _2A_desc | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | kiw | Kiwai (Southern) | Kiwaian | Kiwaian | (-8.0, 143.5) | Phonology | 1. Small | 1 | Small | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
| 1 | xoo | !Xóõ | Tu | Tu | (-24.0, 21.5) | Phonology | 5. Large | 5 | Large | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
| 2 | ani | //Ani | Khoe-Kwadi | Khoe-Kwadi | (-18.9166666667, 21.9166666667) | Phonology | 5. Large | 5 | Large | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
| 3 | abi | Abipón | South Guaicuruan | Guaicuruan | (-29.0, -61.0) | Phonology | 2. Moderately small | 2 | Moderately small | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
| 4 | abk | Abkhaz | Northwest Caucasian | Northwest Caucasian | (43.0833333333, 41.0) | Phonology | 5. Large | 5 | Large | Phonology | 1. Small (2-4) | 1 | Small (2-4) |
Map example for feature 1A:
m = lingtypology.LingMap(wals_page.language)
m.add_custom_coordinates(wals_page.coordinates)
m.add_features(
wals_page._1A,
colors=lingtypology.gradient(5, 'yellow', 'green')
)
m.legend_title = 'Consonant Inventory'
m.create_map()
3. Autotyp¶
It is possible to access Autotyp data (online) using
lingtypology.datasets.Autotyp.
Unlike in Wals, each new tablename passed into Autotyp gives several
additional columns:
Autotyp_table = Autotyp('Gender', 'Agreement').get_df(strip_na=['Gender.binned4'])
Autotyp_table.head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
| language | LID | Gender.n | Gender.binned4 | Gender.Presence | VPolyagreement.Presence.v2 | VPolyagreement.Presence.v1 | |
|---|---|---|---|---|---|---|---|
| 0 | Godoberi | 1531 | 3 | 3 genders | True | False | False |
| 1 | Bininj Kun-Wok | 655 | 4 | 4 genders | True | True | True |
| 2 | Luvale | 553 | 10 | more than 4 genders | True | True | False |
| 3 | North-Central Dargwa | 2949 | 3 | 3 genders | True | True | True |
| 4 | Gaagudju | 82 | 4 | 4 genders | True | True | True |
Now we can draw a map out of gender data from multiple languages.
m = lingtypology.LingMap(Autotyp_table.language)
m.add_features(
Autotyp_table['Gender.binned4'],
colors=lingtypology.gradient(4, color1='yellow', color2='red')
)
m.legend_title = 'Genders'
m.create_map()
4. AfBo¶
from lingtypology.datasets import AfBo
adj = AfBo('adjectivizer').get_df()
adj.head()
Seifart, Frank. 2013.
AfBo: A world-wide survey of affix borrowing.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://afbo.info, Accessed on 2019-06-13.)
| language_recipient | language_donor | reliability | adjectivizer | |
|---|---|---|---|---|
| 0 | Resígaro | Bora | high | 0 |
| 1 | Gurindji Kriol | Gurindji | high | 0 |
| 2 | Copper Island Aleut | Russian | high | 0 |
| 3 | Sakha | Mongolian | high | 4 |
| 4 | Kalderash Romani | Romanian | high | 1 |
m = lingtypology.LingMap(adj.language_recipient)
m.add_features(adj['adjectivizer'], numeric=True)
m.legend_title = 'Adj'
m.create_map()
5. SAILS¶
from lingtypology.datasets import Sails
To get a pandas.DataFrame of features and descriptions:
Sails().features_descriptions.head()
| Feature | Description | |
|---|---|---|
| 0 | ICU17 | Is plurality in independent pronouns expressed... |
| 1 | ICU16 | Is plurality in independent pronouns expressed... |
| 2 | ICU15 | Is plurality in independent pronouns expressed... |
| 3 | ICU14 | Is an associative or collective plural disting... |
| 4 | ICU13 | Are nouns denoting inanimates marked for plural? |
Get description for particular features:
Sails().feature_descriptions('ICU10', 'ICU11')
| Feature | Description | |
|---|---|---|
| 0 | ICU10 | Is nominal plural marking obligatory? |
| 1 | ICU11 | Are nouns denoting humans marked for plural? |
To get the SAILS data as dict, you can use get_json method. To
get data as pandas.DataFrame you can run:
sails = Sails('ICU3', 'ICU4')
df = sails.get_df()
df.head()
You probably should cite it, but I don't understand how. Please, consult https://sails.clld.org/
| language | coordinates | ICU3 | ICU3_desc | ICU4 | ICU4_desc | |
|---|---|---|---|---|---|---|
| 0 | Baniva | (5.26123, -67.56326999999999) | 1 | Yes | 0 | No |
| 1 | Apolista | (-14.83, -68.66) | 0 | No | ? | ? |
| 2 | Yavitero | (2.800281, -68.08421899999999) | 1 | Yes | 0 | No |
| 3 | Resígaro | (-2.48139, -71.35778) | 0 | No | 0 | No |
| 4 | Tol | (14.66859, -87.03719) | 0 | No | 0 | No |
Map example:
m = lingtypology.LingMap(df.language)
m.add_features(df.ICU3_desc)
m.legend_title = sails.feature_descriptions('ICU3').Description.at[0]
m.start_location = (9, -79)
m.start_zoom = 5
m.legend_position = 'bottomleft'
m.create_map()
6. Phoible¶
from lingtypology.datasets import Phoible
Unlike in other databases you do not pass features into Phoible. You should pass the subset. Take a look:
p = Phoible()
p.get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
| contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
| 1 | KOREAN (UPSID 423) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 32 | 21 | 11 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2170.html | https://phoible.org/languages/kore1280 |
| 2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
| 3 | KET (UPSID 399) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 25 | 18 | 7 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2706.html | https://phoible.org/languages/kett1243 |
| 4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
There are several entries for different languages: it happens because Phoible data consists of several different subsets. You can get the list of available subsets:
p.subsets_list
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']
… and pass them into the class:
p = Phoible(subset='SPA')
df = p.get_df(strip_na=['tones'])
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
| contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
| 1 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
| 2 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
| 3 | Kabardian (SPA 4) | Kabardian | (43.5082, 43.3918) | kaba1278 | Eurasia | 56 | 49 | 7 | 0 | https://archive.org/details/kbd_SPA1979_phon | https://phoible.org/languages/kaba1278 |
| 4 | Georgian (SPA 5) | Georgian | (41.850396999999994, 43.78613) | nucl1302 | Eurasia | 35 | 29 | 6 | 0 | https://archive.org/details/kat_SPA1979_phon | https://phoible.org/languages/nucl1302 |
You can also get non-aggregated data by setting aggregated to
False while initializing the class.
Phoible(aggregated=False).get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
| InventoryID | Glottocode | ISO6393 | LanguageName | SpecificDialect | GlyphID | Phoneme | Allophones | Marginal | SegmentClass | ... | retractedTongueRoot | advancedTongueRoot | periodicGlottalSource | epilaryngealSource | spreadGlottis | constrictedGlottis | fortis | raisedLarynxEjective | loweredLarynxImplosive | click | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | kore1280 | kor | Korean | ~N/A~ | 0061 | a | a | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
| 1 | 1 | kore1280 | kor | Korean | ~N/A~ | 0061+02D0 | aː | aː | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
| 2 | 1 | kore1280 | kor | Korean | ~N/A~ | 00E6 | æ | ɛ æ | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
| 3 | 1 | kore1280 | kor | Korean | ~N/A~ | 00E6+02D0 | æː | æː | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
| 4 | 1 | kore1280 | kor | Korean | ~N/A~ | 0065 | e | e | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
5 rows × 48 columns
Map example:
m = lingtypology.LingMap(df.language)
m.colormap_colors = ('white', 'red')
m.add_features(df.tones, numeric=True)
m.legend_title = 'Tones'
m.legend_position = 'bottomleft'
m.create_map()
Another example (slow due to large amount of data):
df = Phoible(subset='UPSID', aggregated=False).get_df()
#Get all languages with ejectives
df = df[df.raisedLarynxEjective == '+']
#Remove duplicates
df = df.drop_duplicates(subset='Glottocode')
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
| InventoryID | Glottocode | ISO6393 | LanguageName | SpecificDialect | GlyphID | Phoneme | Allophones | Marginal | SegmentClass | ... | retractedTongueRoot | advancedTongueRoot | periodicGlottalSource | epilaryngealSource | spreadGlottis | constrictedGlottis | fortis | raisedLarynxEjective | loweredLarynxImplosive | click | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7570 | 198 | afad1236 | aal | KOTOKO | ~N/A~ | 0063+02BC | cʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
| 7802 | 206 | ahte1237 | aht | AHTNA | ~N/A~ | 006B+02BC | kʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
| 7920 | 211 | qawa1238 | alc | QAWASQAR | ~N/A~ | 006B+02BC | kʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
| 8131 | 218 | hame1242 | amf | HAMER | ~N/A~ | 0071+02BC | qʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
| 8157 | 219 | amha1245 | amh | AMHARIC | ~N/A~ | 006B+02B7+02BC | kʷʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
5 rows × 48 columns
m = lingtypology.LingMap(df.Glottocode, glottocode=True)
m.title = 'Languages with Ejectives'
m.tiles = 'Stamen Terrain'
m.radius = 5
m.opacity = 0.5
m.colors = ('blue',)
m.create_map()