Memory Alpha • rtrek

The rtrek package includes some Star Trek datasets, but much more data is available outside the package. You can access other Star Trek data through various APIs.

Technically, there is only one formal API: the Star Trek API (STAPI). rtrek has functions to assist with making calls to this API in order to access specific data. See the STAPI vignette for details.

The focus of this vignette is on accessing data from Memory Alpha. rtrek interfaces with and extracts information from the Memory Alpha and Memory Beta websites. Neither of these sites actually expose an API, but functions in rtrek with querying these websites in an API-like manner. For working with Memory Beta content, see the respective vignette.

The rtrek Memory Alpha API

Memory Alpha is a website that hosts information on all things relating to official canon Star Trek. This strictly pertains to the television series and movies. There are many other officially licensed Star Trek productions, e.g., the many hundreds of novels, but these are not technically canon even though they are often treated as much by many fans. For a broader, licensed works focus, see Memory Beta.

When talking about using rtrek to access data from Memory Alpha, the term data is used loosely. It would be just as accurate to say information, content or text. While the site contains a vast amount of information, it is not structured in tidy tables like a data scientist would love to conveniently encounter. Memory Alpha is a wiki and can be thought of as similar to an encyclopedia. The bulk of its pages consist of articles. While some of these may have interesting html tables contained within, the site largely offers textual data.

Since Memory Alpha does not offer an API, the API-like interfacing provided by rtrek is just a collection of wrappers around web page scraping. In terms of what the relevant functions bring back from Memory Alpha, there are real limitations on the level of generality and quality of formatting that can be achieved across such a massive and diverse collection of articles.

Memory Alpha portals

There are six Memory Alpha web portals available. To see them, call the main function for Memory Alpha access, memory_alpha(), and pass it portals as the API endpoint.

memory_alpha("portals")
#> # A tibble: 6 × 2
#>   id         url                       
#>   <chr>      <chr>                     
#> 1 alternate  Portal:Alternate_Reality  
#> 2 people     Portal:People             
#> 3 science    Portal:Science            
#> 4 series     Portal:TV_and_films       
#> 5 society    Portal:Society_and_Culture
#> 6 technology Portal:Technology

The data frame returned provides each portal ID and respective “short URL”. These relative URLs are given in order to reduce verbosity and redundancy. All absolute URLs begin with https://memory-alpha.fandom.com/wiki/.

In this special case where endpoint = "portals", this table is returned from the package itself because it is already known. The available portals are fixed. There is no accessing of Memory Alpha yet. The URLs shown are also not needed by the user, but are provided alongside the IDs for context.

Using a portal

When using a specific portal at the highest level (portal ID only), the returned data frame contains information about searchable categories available in the portal.

memory_alpha("people")
#> # A tibble: 104 × 3
#>    id          url                  group     
#>    <chr>       <chr>                <chr>     
#>  1 Acamarians  Category:Acamarians  By species
#>  2 Akritirians Category:Akritirians By species
#>  3 Aldeans     Category:Aldeans     By species
#>  4 Andorians   Category:Andorians   By species
#>  5 Androids    Category:Androids    By species
#>  6 Angosians   Category:Angosians   By species
#>  7 Aquans      Category:Aquans      By species
#>  8 Ardanans    Category:Ardanans    By species
#>  9 Augments    Category:Augments    By species
#> 10 Ba'ku       Category:Ba%27ku     By species
#> # ℹ 94 more rows

Again, there are id and url columns. There is also a group (and potentially a subgroup) column. This is only to provide meaningful context for the values in the id column if relevant for a given portal; group is not used for anything and the user can ignore it.

The above call does involve reaching out to Memory Alpha. While the portals are stable, it is expected that content within is regularly updated. Remember that this is not a real API. Since one is not available, what is really going on behind the scenes is the use of xml2 and rvest for web page harvesting.

Some portals have terminal endpoints - in Memory Alpha these are the written articles - at the top level, but typically the top level results for a portal are categories. You can always differentiate categories from articles by the URL, which will begin with Category: in the former case.

Descending through subcategories is done by appending their id values, separated by a forward slash /.

memory_alpha("people/Klingons")
#> # A tibble: 250 × 2
#>    Klingons                       url                                    
#>    <chr>                          <chr>                                  
#>  1 Memory Alpha images (Klingons) Category:Memory_Alpha_images_(Klingons)
#>  2 Amar                           Amar                                   
#>  3 Antaak                         Antaak                                 
#>  4 A'trom                         A%27trom                               
#>  5 Atul                           Atul                                   
#>  6 Augments                       Category:Augments                      
#>  7 Azetbur                        Azetbur                                
#>  8 Ba'el                          Ba%27el                                
#>  9 Ba'ktor                        Ba%27ktor                              
#> 10 Barak-Kadan                    Barak-Kadan                            
#> # ℹ 240 more rows

memory_alpha("people/Klingons/Worf")
#> # A tibble: 1 × 4
#>   title content    metadata          categories       
#>   <chr> <list>     <list>            <list>           
#> 1 Worf  <xml_ndst> <tibble [1 × 16]> <tibble [15 × 2]>

Note the change in the structure of the final output, which is an article. This is the end of this particular road The result is still a data frame, but now has only one row, the article.

The columns include a text title and three nested datasets. content contains an xml_nodeset object left (mostly) unadulterated by memory_alpha(). This contains the article’s main content section, including ordered content from a default set of html tags. For more control over article content, see ma_article() in the next section. metadata contains a nested data frame of content parsed from the summary card that appears in the top right corner of articles. If this fails to parse for a given article, NULL is returned. categories returns a data frame containing categories in which the article topic falls under and their respective URLs.

Articles

If you already know the article id, You can obtain an article directly using ma_article() instead of going through an endpoint with memory_alpha() that terminates in the same id. This also offers additional options to control what tags are included in the returned result and whether that result is the original xml_nodeset object or a character vector of only the extracted text. In either case, work is left to the user to do what they intend such as text analysis.

worf <- ma_article("Worf", content_format = "character", content_nodes = c("h2", "h3"))
worf
#> # A tibble: 1 × 4
#>   title content    metadata          categories       
#>   <chr> <list>     <list>            <list>           
#> 1 Worf  <chr [34]> <tibble [1 × 16]> <tibble [15 × 2]>
worf$content[[1]] # Worf article section headings
#>  [1] "Early life"                          "The Rozhenkos"                      
#>  [3] "Coming of age"                       "Starfleet career"                   
#>  [5] "Service aboard the USS Enterprise-D" "Service on Deep Space 9"            
#>  [7] "Service aboard the USS Enterprise-E" "Other adventures"                   
#>  [9] "Later career"                        "Personality"                        
#> [11] "Physicality"                         "As a warrior"                       
#> [13] "Ailments and injuries"               "Family"                             
#> [15] "K'Ehleyr"                            "Alexander"                          
#> [17] "Jeremy Aster"                        "Jadzia Dax"                         
#> [19] "Kurn"                                "Nikolai Rozhenko"                   
#> [21] "Martok"                              "Friendships"                        
#> [23] "The crew of the Enterprise"          "Deep Space 9 companions"            
#> [25] "Alternate realities and timelines"   "Holograms"                          
#> [27] "Memorable quotes"                    "Chronology"                         
#> [29] "Appendices"                          "See also"                           
#> [31] "Appearances"                         "Background information"             
#> [33] "Apocrypha"                           "External links"

If browse = TRUE the article page also launches in the browser.

Images

Full resolution source images can be downloaded and imported into R using ma_image() if you know the short URL. The easiest way to find URLs is by using a Memory Alpha portal. In the example below, the Memory Alpha images category under Klingons is selected. Look for a picture that includes Worf but also Data.

library(dplyr)
klingons <- memory_alpha("people/Klingons/Memory Alpha images (Klingons)")
klingons
#> # A tibble: 1,388 × 2
#>    `Memory Alpha images (Klingons)`             url                             
#>    <chr>                                        <chr>                           
#>  1 Memory Alpha images (Klingon holograms)      Category:Memory_Alpha_images_(K…
#>  2 Age of ascension pain sticks.jpg             File:Age_of_ascension_pain_stic…
#>  3 Ajilon Prime Klingon 1.jpg                   File:Ajilon_Prime_Klingon_1.jpg 
#>  4 Ajilon Prime Klingon 2.jpg                   File:Ajilon_Prime_Klingon_2.jpg 
#>  5 Ajilon Prime Klingons.jpg                    File:Ajilon_Prime_Klingons.jpg  
#>  6 Alexander Rozhenko as a deputy.jpg           File:Alexander_Rozhenko_as_a_de…
#>  7 Alexander Rozhenko at Kot'baval Festival.jpg File:Alexander_Rozhenko_at_Kot%…
#>  8 Alexander Rozhenko, 2367.jpg                 File:Alexander_Rozhenko,_2367.j…
#>  9 Alexander Rozhenko, 2370.jpg                 File:Alexander_Rozhenko,_2370.j…
#> 10 Alexander Rozhenko, 2374.jpg                 File:Alexander_Rozhenko,_2374.j…
#> # ℹ 1,378 more rows

worf_data <- filter(klingons, grepl("Worf", url) & grepl("Data", url))
worf_data
#> # A tibble: 10 × 2
#>    `Memory Alpha images (Klingons)`                                 url         
#>    <chr>                                                            <chr>       
#>  1 Data and Worf, 2369.jpg                                          File:Data_a…
#>  2 Data tries talking to Worf.jpg                                   File:Data_t…
#>  3 Data, Picard and Worf, 2375.jpg                                  File:Data,_…
#>  4 Data, Worf, and Beverly Crusher find Wesley Crusher.jpg          File:Data,_…
#>  5 La Forge, Data, Riker, Worf, and Picard, 2379.jpg                File:La_For…
#>  6 Picard Data and Worf on Iconia.jpg                               File:Picard…
#>  7 Picard, Data, and Worf away team.jpg                             File:Picard…
#>  8 Thomas Riker and William Riker play poker with Data and Worf.jpg File:Thomas…
#>  9 Wesley Crusher, Worf, and Data, quantum reality.jpg              File:Wesley…
#> 10 Worf carries Data through portal.jpg                             File:Worf_c…

Qapla’! This provides several results.

Technically, this is not the url to an image file. It is a url that redirects you to some other seemingly random article on the website that happens to include the image in it. This is not necessarily a unique instance of the image, nor is there any consistency in what portal or type of article it takes you to. memory_alpha(), and ma_article() using the short form url, provide the article content associated with the “file” url. See ma_image() below for viewing the actual image.

x <- memory_alpha("people/Klingons/Memory Alpha images (Klingons)/Data tries talking to Worf.jpg")
x
#> # A tibble: 1 × 4
#>   title                       content    metadata categories      
#>   <chr>                       <list>     <list>   <list>          
#> 1 The Icarus Factor (episode) <xml_ndst> <NULL>   <tibble [1 × 2]>

x <- ma_article("File:Data_tries_talking_to_Worf.jpg")
x
#> # A tibble: 1 × 4
#>   title                       content    metadata categories      
#>   <chr>                       <list>     <list>   <list>          
#> 1 The Icarus Factor (episode) <xml_ndst> <NULL>   <tibble [1 × 2]>

x$categories
#> [[1]]
#> # A tibble: 1 × 2
#>   categories   url                  
#>   <chr>        <chr>                
#> 1 TNG episodes Category:TNG_episodes

The likely intent is to obtain an image file after browsing the web pages that list images files. Even if you are interactively browsing the website, you have to click several times and scroll through additional articles before you can actually view the image file that was initially presented to you as a clickable link. This is a frustrating user experience and confusing design. If you have a file name you want to view, just use ma_image() for this. It returns a ggplot object of the image file rather than an associated article.

ma_image("File:Data_tries_talking_to_Worf.jpg")

ma_image() can take the additional arguments, keep = TRUE to retain the downloaded image file, and file to specify the output filename if you do not want it to be derived from the short URL. If you need more control over the plot, set keep = TRUE and then load the image file into R directly to plot separately as needed.

Search

You can perform a Memory Alpha site search using ma_search(). This returns a data frame of search results content, including title, truncated text preview, and short URL for the first page of search results.

It does not recursively collate search results through subsequent pages of results. There could be an unexpectedly high number of pages of results depending on the search query. Since the general nature of this search feature seems relatively casual anyway, it aims only to provide a first page preview. As with ma_article(), setting browse = TRUE opens the page in the browser.

ma_search("Guinan")
#> # A tibble: 25 × 3
#>    title                                                       text        url  
#>    <chr>                                                       <chr>       <chr>
#>  1 Guinan                                                      "(covers i… http…
#>  2 Guinan: Intergalactic Woman of Mystery (and Hats) (podcast) "(written … http…
#>  3 Guinan Revisited (podcast)                                  "(written … http…
#>  4 Francis Guinan                                              "(written … http…
#>  5 Unnamed El-Aurians                                          "The follo… http…
#>  6 Unnamed individuals (unknown era)                           "List of u… http…
#>  7 Q                                                           "For addit… http…
#>  8 Borg                                                        "(covers i… http…
#>  9 USS Enterprise (NCC-1701-D)                                 "(covers i… http…
#> 10 Jean-Luc Picard                                             "(covers i… http…
#> # ℹ 15 more rows

Caveats

Memory Alpha contains almost 50,000 pages at the time of this rtrek version. It is possible that some articles may have idiosyncratic structure that could make them inaccessible by these rtrek functions.

Since this package version is also the first to offer this brand new functionality - and as mentioned, Memory Alpha does not offer an API, leading to a less reliable web-scraping approach, it is unknown what the likelihood is at this time of breaking changes occurring during updates to Memory Alpha by its maintainers.

Jolan Tru.