How do I download data with a Python script? : Alveo Support

Currently you can download an item list from the web application as a zip file. This has two drawbacks, firstly it will grab all of the files associated with an item, secondly it may take a long time to make the zip file for download.

In particular for Austalk data, there are many files associated with each item representing the different audio channels recorded and possibly a TextGrid annotation file. Most likely you only want the TextGrid file and the speaker headset audio data.

Zip download means that the server must first construct a zip file for you with the selected files, then allow you to download it. This is fine for small datasets but when the data gets large it doesn't work very well.

In a future release we'll address these problems directly. We'll offer a way to select which files are included in a download and we provide an alternative download method using Aspera which is much faster and avoids the need for zip file creation.

However, an alternative is to write a script to use the API to download the data. This has the advantage of allowing you to be selective in what you download and since the download happens file-by-file, it does not need to wait for the server to do any work before it gets started.

The attached file is a sample script to download data from an Item List, particularly written to deal with Austalk data (it could be modified for other collections). The comments in the file give some instructions on how to use it.

The main work in the script is done by the following code:

client = pyalveo.Client()

itemlist = client.get_item_list(itemlist_url)

for itemurl in itemlist:
    item = client.get_item(itemurl)
    meta = item.metadata()
    speakerid = meta['alveo:metadata']['olac:speaker']

    for doc in item.get_documents():
        filename = doc.get_filename()
        if filename.endswith('speaker16.wav') or filename.endswith('.TextGrid'):
            doc.download_content(dir_path=subdir)

The first line connects the script to Alveo by creating a client object. We then grab the target item list and for each item, get the documents we're interested in. The script looks a the item metadata to find the speaker identifier and stores each speaker in a separate directory.

This is just a simple example of a script that uses the PyAlveo library to access data via the API. You can read more about the library in the documentation.

Alveo Support

How can we help you today?

How do I download data with a Python script? Print

How can we help you today?

How do I download data with a Python script? Print

Related Articles