lingtypology.datasets

Intro

One of the objectives of LingTypology is to provide a simple interface for linguistic databases. Therefore, classes used for acccessing them have unified API: most attributes and methods overlap among all of them. In the following two sections I will describe this universal interface.

Universal Attributes

  • show_citation (bool, default True)
    Whether to print the citation when get_df method is called.
  • citation (str)
    Citation for the database.
  • features_list or subsets_list list of str
    List of available features for all the databases except for Phoible. In the case of Phoible it is list of available subsets (UPSID, SPA etc.).

Universal Methods

  • get_df

    In all cases parameters are optional. They depend on the particular class.

    In the case of Wals it has optional str parameter join_how: the way multiple WALS pages will be joined (either inner or outer). If the value is inner, the resulting table will only contain data for languages mentioned in all the given pages. Else, the resulting table will contain values mentioned in at least one of the pages. Default: inner.

    In the case of Autotyp and Phoible it has optional list parameter strip_na. It is a list of columns. If this parameter is given, the rows where some values in the given columns are not present will be dropped. Default: [].

    Returns the dataset as pandas.DataFrame.

  • get_json

    It works the same way as get_df but it returns dict object where keys are headers of the table.

Classes

class lingtypology.datasets.Wals(*features)

Bases: object

WALS database.

WALS: ’The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors.’ (Dryer and Haspelmath 2013). The data from wals is retrieved from multiple web-pages that contain data for each chapter when get_df method is called.

Parameters:*features (list of str) – List of WALS pages that will be present in the resulting table. E.g. ['1A'].
general_citation

The general citation for all the WALS pages.

Type:str
show_citation

Whether to print the citation for the given features when get_df method is called.

Type:str
features_list

List of all the WALS pages.

Type:str
citation

Citation for the given WALS pages.

Type:str
get_df(join_how='inner')

Get data from WALS in pandas.DataFrame format.

Returns:DataFrame. Headers: ‘wals code’, ‘language’, ‘genus’, ‘family’, ‘area’, ‘coordinates’, [[name of the page1]], [[name of the page2]], … Names of the pages start with ‘_’.
Return type:pandas.DataFrame
get_json(join_how='inner')

Get data from Wals in JSON format.

Returns:Dictionary. Keys: ‘wals code’, ‘language’, ‘genus’, ‘family’, ‘area’, ‘coordinates’, [[name of the page1]], [[name of the page2]], … Names of the pages start with ‘_’.
Return type:dict
class lingtypology.datasets.Autotyp(*tables)

Bases: object

Autotyp database.

Autotyp is database that contains of multiple modules. Each module represents a grammatical feature (e.g. Agreeement), it contains information on this feature for various languages (Bickel et al. 2017). The data is downloaded when get_df method is called.

Parameters:*tables (list of str) – List of the Autoptyp tables that will be merged in the resulting table. E.g. ['gender'].
show_citation

Whether to print the citation when get_df method is called.

Type:str
citation

Citation for the Autotyp database.

Type:str
features_list

List of available Autotyp tables.

Type:list
get_df(strip_na=None)

Get data from Autotyp in pandas.DataFrame format.

Returns:DataFrame. Headers: ‘Language’, ‘LID’, [[features columns]]
Return type:pandas.DataFrame
get_json(strip_na=None)

Get data from Autotyp in JSON format.

Returns:Dictionary. Keys: ‘Language’, ‘LID’, [[features columns]]
Return type:dict
class lingtypology.datasets.AfBo(*features)

Bases: object

AfBo database of borrowed affixes.

AfBo: A world-wide survey of affix borrowing (Seifart 2013). AfBo contains information about borrewed affixes in different languages. It provides data in ZIP archive with CSV files. The data is downloaded with initialization of the class.

Parameters:
  • *features (list of str) – List of AfBo features that will be present in the resulting table. E.g. ['adjectivizer'].
  • show_citation (bool, default True) – Whether to print the citation when get_df method is called.
  • citation – Citation for AfBo.
  • features_list (list) – List of available features from AfBo.
get_df()

Get data from AfBo in pandas.DataFrame format.

Returns:DataFrame. Headers: ‘Recipient_name’, ‘Donor_name’, [[feature1]], [[feature2]], …
Return type:pandas.DataFrame
get_json()

Get data from AfBo in JSON format.

Returns:Dictionary. Keys: ‘Recipient_name’, ‘Donor_name’, [[feature1]], [[feature2]], …
Return type:dict
class lingtypology.datasets.Sails(*features)

Bases: object

SAILS dataset.

‘The South American Indigenous Language Structures (SAILS) is a large database of grammatical properties of languages gathered from descriptive materials (such as reference grammars)‘ (Muysken et al. 2016). Like in the case of AfBo, SAILS data is available in ZIP archive. The data is downloaded with initialization of the class.

Parameters:of str (list) – List of SAILS pages that will be included in the resulting table.
show_citation

Whether to print the citation when get_df method is called.

Type:bool, default True
citation

Citation for SAILS.

features_list

List of available features from SAILS.

Type:list
features_descriptions

Table that contain description for all the SAILS pages.

Type:pandas.DataFrame
feature_descriptions(*features)

Get the description for particular features.

Parameters:*features (list) – Features from SAILS.
Returns:
Return type:pandas.DataFrame
get_df()

Get data from SAILS in pandas.DataFrame format.

Returns:DataFrame. Headers: ‘Language’, ‘Coordinates’, [[feature 1]], [[feature 1 human_readable]], [[feature 2]], …
Return type:pandas.DataFrame
get_json()

Get data from SAILS in JSON format.

Returns:Dictionary. Keys: ‘Language’, ‘Coordinates’, [[feature 1]], [[feature 1 human_readable]], [[feature 2]], …
Return type:dict
class lingtypology.datasets.Phoible(subset='all', aggregated=True)

Bases: object

PHOIBLE phonological database.

‘PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample.‘ (Moran and McCloy 2019). Unlike other databases supported by Lingtypology, PHOIBLE is not a unified dataset. It contains data of the following datasets:

  • SAPHON: South American Phonological Inventory Database (Lev, Stark,and Chang 2012).
  • AA: Alphabets of Africa (Chanard 2006).
  • GM: ‘Christopher Green and Steven Moran extracted phonological inventories

from secondary sources including grammars and phonological descriptions with the goal of attaining pan-Africa coverage‘ (Moran, McCloy, and Wright 2014).

  • PH: ‘Christopher Green and Steven Moran extracted phonological inventories

from secondary sources including grammars and phonological descriptions with the goal of attaining pan-Africa coverage‘ (Moran, McCloy, and Wright 2014).

  • RA: Common Linguistic Features in Indian Languages: Phoentics (Ramaswami

1999).

  • SPA: Stanford Phonology Archive (Crothers et al. 1979).
  • UPSID: UCLA Phonological Segment Inventory Database (Maddieson and Precoda 1990).
Parameters:subset (str, default 'all') – One of the PHOIBLE datasets or all of them.
show_citation

Whether to print the citation when get_df method is called.

Type:bool, default True
citation

Citation for PKOIBLE.

subsets_list

List of available subsets of PHOIBLE.

Type:list
get_df(strip_na=None)

Get data from PHOIBLE in pandas.DataFrame format.

Returns:DataFrame. Headers: ‘contribution_name’, ‘language’, ‘coordinates’, ‘glottocode’, ‘macroarea’, ‘consonants’, ‘vowels’, ‘source’, ‘inventory_page’
Return type:pandas.DataFrame
get_json(strip_na=None)

Get data from PHOIBLE in JSON format.

Returns:Dictionary. Keys: ‘contribution_name’, ‘language’, ‘coordinates’, ‘glottocode’, ‘macroarea’, ‘consonants’, ‘vowels’, ‘source’, ‘inventory_page’
Return type:dict