Databases API

API for different linguistic databases can be accessed with lingtypology.datasets.

import lingtypology.datasets

1. General

Lingtypology attempts to provide unified API for given language databases. Therefore, classes in this module share some common attributes and methods. In this paragraph I will describe them and provide examples for Autotyp, Wals and Phoible.

from lingtypology.datasets import Autotyp, Wals, Phoible

1.1. features_list

You can get the list of available features from the database using this attribute.

Autotyp().features_list[:10] #It's cutoff in order not to take took much space
['Agreement',
 'Alienability',
 'Alignment',
 'Alignment_case_splits',
 'Alignment_per_language',
 'Clause_linkage',
 'Clause_word_order',
 'Clusivity',
 'GR_per_language',
 'Gender']

Note: Phoible has no features_list attribute because there are no features. However, it has subsets_list that shows list of available subsets of Phoible data.

Phoible().subsets_list
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']

1.2. get_df and get_json

These two methods access the database and return data as pandas.Series or dict. Example of usage:

Autotyp('Agreement', 'Clusivity').get_df().head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
language LID VPolyagreement.Presence.v2 VPolyagreement.Presence.v1 InclExclAsPerson.Presence InclExclAny.Presence InclExclType InclExclAsMinAug.Presence
0 Ambulas 6 False False False False no i/e False
1 Abkhazian 7 True True False False no i/e False
2 Acehnese 9 True False False True plain i/e type False
3 Western Keres 10 True True False False no i/e False
4 Hokkaido Ainu 12 True True False True plain i/e type False
Note: for Phoible and Autotyp you can use strip_na parameter (list, default: []) to strip rows in which there is empty cell in the given columns. Compare the following.
No strip_na (empty cells are replaced with '~N/A~'):
Phoible().get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
1 KOREAN (UPSID 423) Korean (37.5, 128.0) kore1280 Eurasia 32 21 11 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2170.html https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
3 KET (UPSID 399) Ket (63.7551, 87.5466) kett1243 Eurasia 25 18 7 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2706.html https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252

tones column given to strip_na:

Phoible().get_df(strip_na=['tones']).head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252
6 Kabardian (SPA 4) Kabardian (43.5082, 43.3918) kaba1278 Eurasia 56 49 7 0 https://archive.org/details/kbd_SPA1979_phon https://phoible.org/languages/kaba1278
8 Georgian (SPA 5) Georgian (41.850396999999994, 43.78613) nucl1302 Eurasia 35 29 6 0 https://archive.org/details/kat_SPA1979_phon https://phoible.org/languages/nucl1302

Note: By default when you call get_df or get_json it prints the citation. If you want to disable it, you shoud set the show_citation to False.

p = Phoible()
p.show_citation = False
p.get_df(strip_na=['tones']).head()
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252
6 Kabardian (SPA 4) Kabardian (43.5082, 43.3918) kaba1278 Eurasia 56 49 7 0 https://archive.org/details/kbd_SPA1979_phon https://phoible.org/languages/kaba1278
8 Georgian (SPA 5) Georgian (41.850396999999994, 43.78613) nucl1302 Eurasia 35 29 6 0 https://archive.org/details/kat_SPA1979_phon https://phoible.org/languages/nucl1302

1.3. citation

You can get the citation for each database using citation attribute. E.g.:

from lingtypology.datasets import Autotyp
print(Autotyp().citation)
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0

Note: if you use Wals, citation will be shown for every feature. If you want general citation for the whole Wals, use general_citation.

w = Wals('1a', '2a')
print(w.citation)
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-06-13.)

Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-06-13.)
print(w.general_citation)
Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013.
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info, Accessed on 2019-06-13.)

2. Wals

It is possible to access Wals data (online) using lingtypology.datasets.Wals

from lingtypology.datasets import Wals
wals_page = Wals('1a', '2a').get_df()
wals_page.head()
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-06-13.)

Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-06-13.)
wals_code language genus family coordinates _1A_area _1A _1A_num _1A_desc _2A_area _2A _2A_num _2A_desc
0 kiw Kiwai (Southern) Kiwaian Kiwaian (-8.0, 143.5) Phonology 1. Small 1 Small Phonology 2. Average (5-6) 2 Average (5-6)
1 xoo !Xóõ Tu Tu (-24.0, 21.5) Phonology 5. Large 5 Large Phonology 2. Average (5-6) 2 Average (5-6)
2 ani //Ani Khoe-Kwadi Khoe-Kwadi (-18.9166666667, 21.9166666667) Phonology 5. Large 5 Large Phonology 2. Average (5-6) 2 Average (5-6)
3 abi Abipón South Guaicuruan Guaicuruan (-29.0, -61.0) Phonology 2. Moderately small 2 Moderately small Phonology 2. Average (5-6) 2 Average (5-6)
4 abk Abkhaz Northwest Caucasian Northwest Caucasian (43.0833333333, 41.0) Phonology 5. Large 5 Large Phonology 1. Small (2-4) 1 Small (2-4)

Map example for feature 1A:

m = lingtypology.LingMap(wals_page.language)
m.add_custom_coordinates(wals_page.coordinates)
m.add_features(
    wals_page._1A,
    colors=lingtypology.gradient(5, 'yellow', 'green')
)
m.legend_title = 'Consonant Inventory'
m.create_map()

3. Autotyp

It is possible to access Autotyp data (online) using lingtypology.datasets.Autotyp.

Unlike in Wals, each new tablename passed into Autotyp gives several additional columns:

Autotyp_table = Autotyp('Gender', 'Agreement').get_df(strip_na=['Gender.binned4'])
Autotyp_table.head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
language LID Gender.n Gender.binned4 Gender.Presence VPolyagreement.Presence.v2 VPolyagreement.Presence.v1
0 Godoberi 1531 3 3 genders True False False
1 Bininj Kun-Wok 655 4 4 genders True True True
2 Luvale 553 10 more than 4 genders True True False
3 North-Central Dargwa 2949 3 3 genders True True True
4 Gaagudju 82 4 4 genders True True True

Now we can draw a map out of gender data from multiple languages.

m = lingtypology.LingMap(Autotyp_table.language)
m.add_features(
    Autotyp_table['Gender.binned4'],
    colors=lingtypology.gradient(4, color1='yellow', color2='red')
)
m.legend_title = 'Genders'
m.create_map()

4. AfBo

from lingtypology.datasets import AfBo
adj = AfBo('adjectivizer').get_df()
adj.head()
Seifart, Frank. 2013.
AfBo: A world-wide survey of affix borrowing.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://afbo.info, Accessed on 2019-06-13.)
language_recipient language_donor reliability adjectivizer
0 Resígaro Bora high 0
1 Gurindji Kriol Gurindji high 0
2 Copper Island Aleut Russian high 0
3 Sakha Mongolian high 4
4 Kalderash Romani Romanian high 1
m = lingtypology.LingMap(adj.language_recipient)
m.add_features(adj['adjectivizer'], numeric=True)
m.legend_title = 'Adj'
m.create_map()

5. SAILS

from lingtypology.datasets import Sails

To get a pandas.DataFrame of features and descriptions:

Sails().features_descriptions.head()
Feature Description
0 ICU17 Is plurality in independent pronouns expressed...
1 ICU16 Is plurality in independent pronouns expressed...
2 ICU15 Is plurality in independent pronouns expressed...
3 ICU14 Is an associative or collective plural disting...
4 ICU13 Are nouns denoting inanimates marked for plural?

Get description for particular features:

Sails().feature_descriptions('ICU10', 'ICU11')
Feature Description
0 ICU10 Is nominal plural marking obligatory?
1 ICU11 Are nouns denoting humans marked for plural?

To get the SAILS data as dict, you can use get_json method. To get data as pandas.DataFrame you can run:

sails = Sails('ICU3', 'ICU4')
df = sails.get_df()
df.head()
You probably should cite it, but I don't understand how. Please, consult https://sails.clld.org/
language coordinates ICU3 ICU3_desc ICU4 ICU4_desc
0 Baniva (5.26123, -67.56326999999999) 1 Yes 0 No
1 Apolista (-14.83, -68.66) 0 No ? ?
2 Yavitero (2.800281, -68.08421899999999) 1 Yes 0 No
3 Resígaro (-2.48139, -71.35778) 0 No 0 No
4 Tol (14.66859, -87.03719) 0 No 0 No

Map example:

m = lingtypology.LingMap(df.language)
m.add_features(df.ICU3_desc)
m.legend_title = sails.feature_descriptions('ICU3').Description.at[0]
m.start_location = (9, -79)
m.start_zoom = 5
m.legend_position = 'bottomleft'
m.create_map()

6. Phoible

from lingtypology.datasets import Phoible

Unlike in other databases you do not pass features into Phoible. You should pass the subset. Take a look:

p = Phoible()
p.get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
1 KOREAN (UPSID 423) Korean (37.5, 128.0) kore1280 Eurasia 32 21 11 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2170.html https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
3 KET (UPSID 399) Ket (63.7551, 87.5466) kett1243 Eurasia 25 18 7 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2706.html https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252

There are several entries for different languages: it happens because Phoible data consists of several different subsets. You can get the list of available subsets:

p.subsets_list
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']

… and pass them into the class:

p = Phoible(subset='SPA')
df = p.get_df(strip_na=['tones'])
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
1 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
2 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252
3 Kabardian (SPA 4) Kabardian (43.5082, 43.3918) kaba1278 Eurasia 56 49 7 0 https://archive.org/details/kbd_SPA1979_phon https://phoible.org/languages/kaba1278
4 Georgian (SPA 5) Georgian (41.850396999999994, 43.78613) nucl1302 Eurasia 35 29 6 0 https://archive.org/details/kat_SPA1979_phon https://phoible.org/languages/nucl1302

You can also get non-aggregated data by setting aggregated to False while initializing the class.

Phoible(aggregated=False).get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
InventoryID Glottocode ISO6393 LanguageName SpecificDialect GlyphID Phoneme Allophones Marginal SegmentClass ... retractedTongueRoot advancedTongueRoot periodicGlottalSource epilaryngealSource spreadGlottis constrictedGlottis fortis raisedLarynxEjective loweredLarynxImplosive click
0 1 kore1280 kor Korean ~N/A~ 0061 a a ~N/A~ vowel ... - - + - - - 0 - - 0
1 1 kore1280 kor Korean ~N/A~ 0061+02D0 ~N/A~ vowel ... - - + - - - 0 - - 0
2 1 kore1280 kor Korean ~N/A~ 00E6 æ ɛ æ ~N/A~ vowel ... - - + - - - 0 - - 0
3 1 kore1280 kor Korean ~N/A~ 00E6+02D0 æː æː ~N/A~ vowel ... - - + - - - 0 - - 0
4 1 kore1280 kor Korean ~N/A~ 0065 e e ~N/A~ vowel ... - - + - - - 0 - - 0

5 rows × 48 columns

Map example:

m = lingtypology.LingMap(df.language)
m.colormap_colors = ('white', 'red')
m.add_features(df.tones, numeric=True)
m.legend_title = 'Tones'
m.legend_position = 'bottomleft'
m.create_map()

Another example (slow due to large amount of data):

df = Phoible(subset='UPSID', aggregated=False).get_df()
#Get all languages with ejectives
df = df[df.raisedLarynxEjective == '+']
#Remove duplicates
df = df.drop_duplicates(subset='Glottocode')
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-06-13.)
InventoryID Glottocode ISO6393 LanguageName SpecificDialect GlyphID Phoneme Allophones Marginal SegmentClass ... retractedTongueRoot advancedTongueRoot periodicGlottalSource epilaryngealSource spreadGlottis constrictedGlottis fortis raisedLarynxEjective loweredLarynxImplosive click
7570 198 afad1236 aal KOTOKO ~N/A~ 0063+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
7802 206 ahte1237 aht AHTNA ~N/A~ 006B+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
7920 211 qawa1238 alc QAWASQAR ~N/A~ 006B+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
8131 218 hame1242 amf HAMER ~N/A~ 0071+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
8157 219 amha1245 amh AMHARIC ~N/A~ 006B+02B7+02BC kʷʼ ~N/A~ False consonant ... 0 0 - - - + - + - -

5 rows × 48 columns

m = lingtypology.LingMap(df.Glottocode, glottocode=True)
m.title = 'Languages with Ejectives'
m.tiles = 'Stamen Terrain'
m.radius = 5
m.opacity = 0.5
m.colors = ('blue',)
m.create_map()

Go back up