Overview#

This notebook gives a general overview of the features included in the dataset.

Notebook settings
-----------------

CORPUS_PATH: '/home/runner/work/workflow_deployment/distant_listening_corpus'
ANNOTATED_ONLY: True

Data and software versions
--------------------------

Data repo 'distant_listening_corpus' @ e1afefe
dimcat version 0.3.0
ms3 version 2.5.2

dataset = dc.Dataset()
dataset.load(directory=CORPUS_PATH, parse_tsv=False)

---------------------------------------------------------------------------
DeprecationWarning                        Traceback (most recent call last)
Cell In[5], line 4
      2     annotated_view = dataset.data.get_view('annotated')
      3     annotated_view.include('facets', 'measures', 'notes$', 'expanded')
----> 4     annotated_view.fnames_with_incomplete_facets = False
      5     dataset.data.set_view(annotated_view)
      6 dataset.data.parse_tsv(choose='auto')

File ~/.local/lib/python3.10/site-packages/ms3/view.py:124, in View.fnames_with_incomplete_facets(self, value)
    122 @fnames_with_incomplete_facets.setter
    123 def fnames_with_incomplete_facets(self, value):
--> 124     raise DeprecationWarning(
    125         "'fnames_with_incomplete_facets' was renamed to  'pieces_with_incomplete_facets' in "
    126         "ms3 v2."
    127     )

DeprecationWarning: 'fnames_with_incomplete_facets' was renamed to  'pieces_with_incomplete_facets' in ms3 v2.

Composition dates#

This section relies on the dataset’s metadata.

valid_composed_start = pd.to_numeric(all_metadata.composed_start, errors='coerce')
valid_composed_end = pd.to_numeric(all_metadata.composed_end, errors='coerce')
print(f"Composition dates range from {int(valid_composed_start.min())} {valid_composed_start.idxmin()} "
      f"to {int(valid_composed_end.max())} {valid_composed_end.idxmax()}.")

Mean composition years per corpus#

Composition years histogram#

Dimensions#

Overview#

Measures#

print(f"{len(all_measures.index)} measures over {len(all_measures.groupby(level=[0,1]))} files.")
all_measures.head()

print("Distribution of time signatures per XML measure (MC):")
all_measures.timesig.value_counts(dropna=False)

Harmony labels#

All symbols, independent of the local key (the mode of which changes their semantics).

try:
    all_annotations = dataset.get_facet('expanded')
except Exception:
    all_annotations = pd.DataFrame()
n_annotations = len(all_annotations.index)
includes_annotations = n_annotations > 0
if includes_annotations:
    display(all_annotations.head())
    print(f"Concatenated annotation tables contains {all_annotations.shape[0]} rows.")
    no_chord = all_annotations.root.isna()
    if no_chord.sum() > 0:
        print(f"{no_chord.sum()} of them are not chords. Their values are: {all_annotations.label[no_chord].value_counts(dropna=False).to_dict()}")
    all_chords = all_annotations[~no_chord].copy()
    print(f"Dataset contains {all_chords.shape[0]} tokens and {len(all_chords.chord.unique())} types over {len(all_chords.groupby(level=[0,1]))} documents.")
    all_annotations['corpus_name'] = all_annotations.index.get_level_values(0).map(get_corpus_display_name)
    all_chords['corpus_name'] = all_chords.index.get_level_values(0).map(get_corpus_display_name)
else:
    print(f"Dataset contains no annotations.")