Databases API¶
API for different linguistic databases can be accessed with
lingtypology.datasets
.
import lingtypology.datasets
1. General¶
Lingtypology attempts to provide unified API for given language databases. Therefore, classes in this module share some common attributes and methods. In this paragraph I will describe them and provide examples for Autotyp, Wals and Phoible.
from lingtypology.datasets import Autotyp, Wals, Phoible
1.1. features_list
¶
You can get the list of available features from the database using this attribute.
Autotyp().features_list[:10] #It's cutoff in order not to take took much space
['Agreement',
'Alienability',
'Alignment',
'Alignment_case_splits',
'Alignment_per_language',
'Clause_linkage',
'Clause_word_order',
'Clusivity',
'GR_per_language',
'Gender']
Note: Phoible
has no features_list
attribute because there
are no features. However, it has subsets_list
that shows list of
available subsets of Phoible data.
Phoible().subsets_list
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']
1.2. get_df
and get_json
¶
These two methods access the database and return data as
pandas.Series
or dict
. Example of usage:
Autotyp('Agreement', 'Clusivity').get_df().head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
language | LID | VPolyagreement.Presence.v2 | VPolyagreement.Presence.v1 | InclExclAsPerson.Presence | InclExclAny.Presence | InclExclType | InclExclAsMinAug.Presence | |
---|---|---|---|---|---|---|---|---|
0 | Ambulas | 6 | False | False | False | False | no i/e | False |
1 | Abkhazian | 7 | True | True | False | False | no i/e | False |
2 | Acehnese | 9 | True | False | False | True | plain i/e type | False |
3 | Western Keres | 10 | True | True | False | False | no i/e | False |
4 | Hokkaido Ainu | 12 | True | True | False | True | plain i/e type | False |
Phoible
and Autotyp
you can use strip_na
parameter (list
, default: []
) to strip rows in which there is
empty cell in the given columns. Compare the following.strip_na
(empty cells are replaced with '~N/A~'
):Phoible().get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
1 | KOREAN (UPSID 423) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 32 | 21 | 11 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2170.html | https://phoible.org/languages/kore1280 |
2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
3 | KET (UPSID 399) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 25 | 18 | 7 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2706.html | https://phoible.org/languages/kett1243 |
4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
tones
column given to strip_na
:
Phoible().get_df(strip_na=['tones']).head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
6 | Kabardian (SPA 4) | Kabardian | (43.5082, 43.3918) | kaba1278 | Eurasia | 56 | 49 | 7 | 0 | https://archive.org/details/kbd_SPA1979_phon | https://phoible.org/languages/kaba1278 |
8 | Georgian (SPA 5) | Georgian | (41.850396999999994, 43.78613) | nucl1302 | Eurasia | 35 | 29 | 6 | 0 | https://archive.org/details/kat_SPA1979_phon | https://phoible.org/languages/nucl1302 |
Note: By default when you call get_df
or get_json
it prints
the citation. If you want to disable it, you shoud set the
show_citation
to False
.
p = Phoible()
p.show_citation = False
p.get_df(strip_na=['tones']).head()
contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
6 | Kabardian (SPA 4) | Kabardian | (43.5082, 43.3918) | kaba1278 | Eurasia | 56 | 49 | 7 | 0 | https://archive.org/details/kbd_SPA1979_phon | https://phoible.org/languages/kaba1278 |
8 | Georgian (SPA 5) | Georgian | (41.850396999999994, 43.78613) | nucl1302 | Eurasia | 35 | 29 | 6 | 0 | https://archive.org/details/kat_SPA1979_phon | https://phoible.org/languages/nucl1302 |
1.3. citation
¶
You can get the citation for each database using citation
attribute.
E.g.:
from lingtypology.datasets import Autotyp
print(Autotyp().citation)
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
Note: if you use Wals
, citation will be shown for every feature.
If you want general citation for the whole Wals, use
general_citation
.
w = Wals('1a', '2a')
print(w.citation)
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-06-13.)
Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-06-13.)
print(w.general_citation)
Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013.
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info, Accessed on 2019-06-13.)
2. Wals¶
It is possible to access Wals data (online) using
lingtypology.datasets.Wals
from lingtypology.datasets import Wals
wals_page = Wals('1a', '2a').get_df()
wals_page.head()
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-06-13.)
Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-06-13.)
wals_code | language | genus | family | coordinates | _1A_area | _1A | _1A_num | _1A_desc | _2A_area | _2A | _2A_num | _2A_desc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | kiw | Kiwai (Southern) | Kiwaian | Kiwaian | (-8.0, 143.5) | Phonology | 1. Small | 1 | Small | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
1 | xoo | !Xóõ | Tu | Tu | (-24.0, 21.5) | Phonology | 5. Large | 5 | Large | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
2 | ani | //Ani | Khoe-Kwadi | Khoe-Kwadi | (-18.9166666667, 21.9166666667) | Phonology | 5. Large | 5 | Large | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
3 | abi | Abipón | South Guaicuruan | Guaicuruan | (-29.0, -61.0) | Phonology | 2. Moderately small | 2 | Moderately small | Phonology | 2. Average (5-6) | 2 | Average (5-6) |
4 | abk | Abkhaz | Northwest Caucasian | Northwest Caucasian | (43.0833333333, 41.0) | Phonology | 5. Large | 5 | Large | Phonology | 1. Small (2-4) | 1 | Small (2-4) |
Map example for feature 1A:
m = lingtypology.LingMap(wals_page.language)
m.add_custom_coordinates(wals_page.coordinates)
m.add_features(
wals_page._1A,
colors=lingtypology.gradient(5, 'yellow', 'green')
)
m.legend_title = 'Consonant Inventory'
m.create_map()
3. Autotyp¶
It is possible to access Autotyp data (online) using
lingtypology.datasets.Autotyp
.
Unlike in Wals, each new tablename passed into Autotyp
gives several
additional columns:
Autotyp_table = Autotyp('Gender', 'Agreement').get_df(strip_na=['Gender.binned4'])
Autotyp_table.head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
language | LID | Gender.n | Gender.binned4 | Gender.Presence | VPolyagreement.Presence.v2 | VPolyagreement.Presence.v1 | |
---|---|---|---|---|---|---|---|
0 | Godoberi | 1531 | 3 | 3 genders | True | False | False |
1 | Bininj Kun-Wok | 655 | 4 | 4 genders | True | True | True |
2 | Luvale | 553 | 10 | more than 4 genders | True | True | False |
3 | North-Central Dargwa | 2949 | 3 | 3 genders | True | True | True |
4 | Gaagudju | 82 | 4 | 4 genders | True | True | True |
Now we can draw a map out of gender data from multiple languages.
m = lingtypology.LingMap(Autotyp_table.language)
m.add_features(
Autotyp_table['Gender.binned4'],
colors=lingtypology.gradient(4, color1='yellow', color2='red')
)
m.legend_title = 'Genders'
m.create_map()
4. AfBo¶
from lingtypology.datasets import AfBo
adj = AfBo('adjectivizer').get_df()
adj.head()
Seifart, Frank. 2013.
AfBo: A world-wide survey of affix borrowing.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://afbo.info, Accessed on 2019-06-13.)
language_recipient | language_donor | reliability | adjectivizer | |
---|---|---|---|---|
0 | Resígaro | Bora | high | 0 |
1 | Gurindji Kriol | Gurindji | high | 0 |
2 | Copper Island Aleut | Russian | high | 0 |
3 | Sakha | Mongolian | high | 4 |
4 | Kalderash Romani | Romanian | high | 1 |
m = lingtypology.LingMap(adj.language_recipient)
m.add_features(adj['adjectivizer'], numeric=True)
m.legend_title = 'Adj'
m.create_map()
5. SAILS¶
from lingtypology.datasets import Sails
To get a pandas.DataFrame
of features and descriptions:
Sails().features_descriptions.head()
Feature | Description | |
---|---|---|
0 | ICU17 | Is plurality in independent pronouns expressed... |
1 | ICU16 | Is plurality in independent pronouns expressed... |
2 | ICU15 | Is plurality in independent pronouns expressed... |
3 | ICU14 | Is an associative or collective plural disting... |
4 | ICU13 | Are nouns denoting inanimates marked for plural? |
Get description for particular features:
Sails().feature_descriptions('ICU10', 'ICU11')
Feature | Description | |
---|---|---|
0 | ICU10 | Is nominal plural marking obligatory? |
1 | ICU11 | Are nouns denoting humans marked for plural? |
To get the SAILS data as dict
, you can use get_json
method. To
get data as pandas.DataFrame
you can run:
sails = Sails('ICU3', 'ICU4')
df = sails.get_df()
df.head()
You probably should cite it, but I don't understand how. Please, consult https://sails.clld.org/
language | coordinates | ICU3 | ICU3_desc | ICU4 | ICU4_desc | |
---|---|---|---|---|---|---|
0 | Baniva | (5.26123, -67.56326999999999) | 1 | Yes | 0 | No |
1 | Apolista | (-14.83, -68.66) | 0 | No | ? | ? |
2 | Yavitero | (2.800281, -68.08421899999999) | 1 | Yes | 0 | No |
3 | Resígaro | (-2.48139, -71.35778) | 0 | No | 0 | No |
4 | Tol | (14.66859, -87.03719) | 0 | No | 0 | No |
Map example:
m = lingtypology.LingMap(df.language)
m.add_features(df.ICU3_desc)
m.legend_title = sails.feature_descriptions('ICU3').Description.at[0]
m.start_location = (9, -79)
m.start_zoom = 5
m.legend_position = 'bottomleft'
m.create_map()
6. Phoible¶
from lingtypology.datasets import Phoible
Unlike in other databases you do not pass features into Phoible. You should pass the subset. Take a look:
p = Phoible()
p.get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
1 | KOREAN (UPSID 423) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 32 | 21 | 11 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2170.html | https://phoible.org/languages/kore1280 |
2 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
3 | KET (UPSID 399) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 25 | 18 | 7 | ~N/A~ | http://web.phonetik.uni-frankfurt.de/L/L2706.html | https://phoible.org/languages/kett1243 |
4 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
There are several entries for different languages: it happens because Phoible data consists of several different subsets. You can get the list of available subsets:
p.subsets_list
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']
… and pass them into the class:
p = Phoible(subset='SPA')
df = p.get_df(strip_na=['tones'])
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name | language | coordinates | glottocode | macroarea | phonemes | consonants | vowels | tones | source | inventory_page | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Korean (SPA 1) | Korean | (37.5, 128.0) | kore1280 | Eurasia | 40 | 22 | 18 | 0 | https://archive.org/details/kor_SPA1979_phon | https://phoible.org/languages/kore1280 |
1 | Ket (SPA 2) | Ket | (63.7551, 87.5466) | kett1243 | Eurasia | 32 | 18 | 14 | 0 | https://archive.org/details/ket_SPA1979_phon | https://phoible.org/languages/kett1243 |
2 | Lak (SPA 3) | Lak | (42.1328, 47.0809) | lakk1252 | Eurasia | 69 | 60 | 9 | 0 | https://archive.org/details/lbe_SPA1979_phon | https://phoible.org/languages/lakk1252 |
3 | Kabardian (SPA 4) | Kabardian | (43.5082, 43.3918) | kaba1278 | Eurasia | 56 | 49 | 7 | 0 | https://archive.org/details/kbd_SPA1979_phon | https://phoible.org/languages/kaba1278 |
4 | Georgian (SPA 5) | Georgian | (41.850396999999994, 43.78613) | nucl1302 | Eurasia | 35 | 29 | 6 | 0 | https://archive.org/details/kat_SPA1979_phon | https://phoible.org/languages/nucl1302 |
You can also get non-aggregated data by setting aggregated
to
False
while initializing the class.
Phoible(aggregated=False).get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
InventoryID | Glottocode | ISO6393 | LanguageName | SpecificDialect | GlyphID | Phoneme | Allophones | Marginal | SegmentClass | ... | retractedTongueRoot | advancedTongueRoot | periodicGlottalSource | epilaryngealSource | spreadGlottis | constrictedGlottis | fortis | raisedLarynxEjective | loweredLarynxImplosive | click | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | kore1280 | kor | Korean | ~N/A~ | 0061 | a | a | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
1 | 1 | kore1280 | kor | Korean | ~N/A~ | 0061+02D0 | aː | aː | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
2 | 1 | kore1280 | kor | Korean | ~N/A~ | 00E6 | æ | ɛ æ | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
3 | 1 | kore1280 | kor | Korean | ~N/A~ | 00E6+02D0 | æː | æː | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
4 | 1 | kore1280 | kor | Korean | ~N/A~ | 0065 | e | e | ~N/A~ | vowel | ... | - | - | + | - | - | - | 0 | - | - | 0 |
5 rows × 48 columns
Map example:
m = lingtypology.LingMap(df.language)
m.colormap_colors = ('white', 'red')
m.add_features(df.tones, numeric=True)
m.legend_title = 'Tones'
m.legend_position = 'bottomleft'
m.create_map()
Another example (slow due to large amount of data):
df = Phoible(subset='UPSID', aggregated=False).get_df()
#Get all languages with ejectives
df = df[df.raisedLarynxEjective == '+']
#Remove duplicates
df = df.drop_duplicates(subset='Glottocode')
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
InventoryID | Glottocode | ISO6393 | LanguageName | SpecificDialect | GlyphID | Phoneme | Allophones | Marginal | SegmentClass | ... | retractedTongueRoot | advancedTongueRoot | periodicGlottalSource | epilaryngealSource | spreadGlottis | constrictedGlottis | fortis | raisedLarynxEjective | loweredLarynxImplosive | click | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7570 | 198 | afad1236 | aal | KOTOKO | ~N/A~ | 0063+02BC | cʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
7802 | 206 | ahte1237 | aht | AHTNA | ~N/A~ | 006B+02BC | kʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
7920 | 211 | qawa1238 | alc | QAWASQAR | ~N/A~ | 006B+02BC | kʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
8131 | 218 | hame1242 | amf | HAMER | ~N/A~ | 0071+02BC | qʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
8157 | 219 | amha1245 | amh | AMHARIC | ~N/A~ | 006B+02B7+02BC | kʷʼ | ~N/A~ | False | consonant | ... | 0 | 0 | - | - | - | + | - | + | - | - |
5 rows × 48 columns
m = lingtypology.LingMap(df.Glottocode, glottocode=True)
m.title = 'Languages with Ejectives'
m.tiles = 'Stamen Terrain'
m.radius = 5
m.opacity = 0.5
m.colors = ('blue',)
m.create_map()