The rtrek
package includes some Star Trek datasets, but
much more data is available outside the package. You can access other
Star Trek data through various APIs.
Technically, there is only one formal API: the Star Trek API (STAPI).
rtrek
has functions to assist with making calls to this API
in order to access specific data. See the STAPI vignette
for details.
The focus of this vignette is on accessing data from Memory Beta.
rtrek
interfaces with and extracts information from the Memory Alpha
and Memory
Beta websites. Neither of these sites actually expose an API, but
functions in rtrek
with querying these websites in an
API-like manner. For working with Memory Alpha content, see the
respective vignette.
Memory Beta is a website that hosts information on all things relating to officially licensed Star Trek productions, for example novels, comics, etc. These are official, but this is not the same and canon. For a canon-only focus, see Memory Alpha.
When talking about using rtrek
to access data from
Memory Beta, the term data is used loosely. It would be just as accurate
to say information, content or text. While the site contains a vast
amount of information, it is not structured in tidy tables like a data
scientist would love to conveniently encounter. Memory Beta is a wiki
and can be thought of as similar to an encyclopedia. The bulk of its
pages consist of articles. While some of these may have interesting html
tables contained within, the site largely offers textual data.
Since Memory Beta does not offer an API, the API-like interfacing
provided by rtrek
is just a collection of wrappers around
web page scraping. In terms of what the relevant functions bring back
from Memory Beta, there are real limitations on the level of generality
and quality of formatting that can be achieved across such a massive and
diverse collection of articles.
To see the available Memory Beta portals, call the main function for
Memory Beta access, memory_beta()
, and pass it
portals
as the API endpoint.
memory_beta("portals")
#> # A tibble: 13 × 2
#> id url
#> <chr> <chr>
#> 1 books Category:Books
#> 2 comics Category:Comics
#> 3 characters Category:Characters
#> 4 culture Category:Culture
#> 5 games Category:Games
#> 6 geography Category:Geography
#> 7 locations Category:Locations
#> 8 materials Category:Materials_and_substances
#> 9 politics Category:Politics
#> 10 science Category:Science
#> 11 starships Category:Starships
#> 12 technology Category:Technology
#> 13 timeline Category:Timeline
The data frame returned provides each portal ID and respective “short
URL”. These relative URLs are given in order to reduce verbosity and
redundancy. All absolute URLs begin with
https://memory-beta.fandom.com/wiki/
.
In this special case where endpoint = "portals"
, this
table is returned from the package itself because it is already known.
The available portals are fixed. There is no accessing of Memory Beta
yet. The URLs shown are also not needed by the user, but are provided
alongside the IDs for context.
In contrast to the results of memory_alpha("portals")
,
notice that each url
entry begins with
Category:
rather than Portal
. The Memory Beta
website does not explicitly offer “portals” like the Memory Alpha site.
For consistency from the perspective of rtrek
, several
top-level categories at Memory Beta are treated as portals. The
timeline
portal is a special case. It is a shorthand for
the timeline
subcategory at
culture/history/timeline
.
Memory Beta is structured very similarly to Memory alpha, but also
more consistently. The structural differences in content returned across
some portals accessed by memory_alpha()
are not seen when
obtaining data with memory_beta()
. This can make working
with Memory Beta data a smoother experience.
When using a specific portal at the highest level (portal ID only), the returned data frame contains information about searchable categories available in the portal.
memory_beta("characters")
#> # A tibble: 41 × 2
#> characters url
#> <chr> <chr>
#> 1 Biography Category:Biography
#> 2 Characters (alternate reality) Category:Characters_(alternate_reality)
#> 3 Memory Beta images (characters) Category:Memory_Beta_images_(characters)
#> 4 First Splinter timeline characters Category:First_Splinter_timeline_characte…
#> 5 Artificial beings Category:Artificial_beings
#> 6 Characters (alternates) Category:Characters_(alternates)
#> 7 Characters by affiliation Category:Characters_by_affiliation
#> 8 Characters by planet Category:Characters_by_planet
#> 9 Charteris Charteris
#> 10 Crossover characters Category:Crossover_characters
#> # ℹ 31 more rows
Again, there are id
and url
columns. Unlike
with high-level memory_alpha()
results, there are no
group
or subgroup
columns. These are not
applicable given the simpler structure and more consistent content
structure of category and article pages at Memory Beta.
The above call does involve reaching out to Memory Beta. While the
portals are stable, it is expected that content within is regularly
updated. Remember that this is not a real API. Since one is not
available, what is really going on behind the scenes is the use of
xml2
and rvest
for web page harvesting.
Some portals have terminal endpoints - in Memory Beta these are the
written articles - at the top level, but typically the top level results
for a portal are categories. You can always differentiate categories
from articles by the URL, which will begin with Category:
in the former case.
Descending through subcategories is done by appending their
id
values, separated by a forward slash /
.
Notice that compared to Memory Alpha, classification and resulting
hierarchical organization can be more detailed for similar content.
x <- "characters/Characters by races and cultures/Klingonoids/Klingons"
memory_beta(x)
#> # A tibble: 1,352 × 2
#> Klingons url
#> <chr> <chr>
#> 1 Klingon Klingon
#> 2 Klingon (mirror) Klingon_(mirror)
#> 3 Kring's battle cruiser personnel Kring%27s_battle_cruiser_personnel
#> 4 Unnamed Klingons Unnamed_Klingons
#> 5 M'Char boarding party M%27Char_boarding_party
#> 6 Aakan Aakan
#> 7 Aak'Torr Aak%27Torr
#> 8 Abbakh Abbakh
#> 9 Adon, son of Gorath Adon,_son_of_Gorath
#> 10 Adrokos Adrokos
#> # ℹ 1,342 more rows
memory_beta(paste0(x, "/Worf"))
#> # A tibble: 1 × 4
#> title content metadata categories
#> <chr> <list> <list> <list>
#> 1 Worf <xml_ndst> <tibble [1 × 17]> <tibble [47 × 2]>
Note the change in the structure of the final output, which is an article. This is the end of this particular road The result is still a data frame, but now has only one row, the article.
The columns include a text title
and three nested
datasets. content
contains an xml_nodeset
object left (mostly) unadulterated by memory_beta()
. This
contains the article’s main content section, including ordered content
from a default set of html tags. For more control over article content,
see mb_article()
in the next section. metadata
contains a nested data frame of content parsed from the summary card
that appears in the top right corner of articles. If this fails to parse
for a given article, NULL
is returned.
categories
returns a data frame containing categories in
which the article topic falls under and their respective URLs.
If you already know the article id
, You can obtain an
article directly using mb_article()
instead of going
through an endpoint with memory_beta()
that terminates in
the same id
. This also offers additional options to control
what tags are included in the returned result and whether that result is
the original xml_nodeset
object or a character vector of
only the extracted text. In either case, work is left to the user to do
what they intend such as text analysis.
worf <- mb_article("Worf", content_format = "character", content_nodes = c("h2", "h3"))
worf
#> # A tibble: 1 × 4
#> title content metadata categories
#> <chr> <list> <list> <list>
#> 1 Worf <chr [15]> <tibble [1 × 17]> <tibble [47 × 2]>
worf$content[[1]] # Worf article section headings
#> [1] "Biography" "Early life"
#> [3] "Starfleet Academy" "Starfleet officer"
#> [5] "Ambassador Worf" "Back with Starfleet"
#> [7] "Promotion" "Path to the 25th century"
#> [9] "Alternate realities" "Other alternate realities"
#> [11] "Interests" "Worf's service record"
#> [13] "Appendices" "Connections"
#> [15] "External links"
If browse = TRUE
the article page also launches in the
browser.
Full resolution source images can be downloaded and imported into R
using mb_image()
if you know the short URL. The easiest way
to find URLs is by using a Memory Beta portal. In the example below, the
Memory Beta images category under Klingons is selected.
The same example used in the Memory Alpha vignette cannot be duplicated here. Memory Beta does not contain as many images from the television series or movies as Memory Alpha. This is because Memory Beta focuses on licensed works in general and does not need to duplicate everything that is already documented at Memory Alpha.
This time, look for a picture that includes Worf but also K’Ehleyr.
library(dplyr)
x <- "characters/Memory Beta images (characters)/Memory Beta images (Worf)"
worf <- memory_beta(x)
worf
#> # A tibble: 137 × 2
#> `Memory Beta images (Worf)` url
#> <chr> <chr>
#> 1 Akvoh.jpg File:Akvoh.jpg
#> 2 Allanor.jpg File:Allanor.jpg
#> 3 Apocalypse Rising.jpg File:Apocalypse_Rising.jpg
#> 4 Areen and Worf.jpg File:Areen_and_Worf.jpg
#> 5 Bajoran riot Marvel.jpg File:Bajoran_riot_Marvel.jpg
#> 6 Barassociation.jpg File:Barassociation.jpg
#> 7 Bodai ShinC.jpg File:Bodai_ShinC.jpg
#> 8 Bridge1701D.jpg File:Bridge1701D.jpg
#> 9 Chancellor Martok.jpg File:Chancellor_Martok.jpg
#> 10 ChancellorWorf.jpg File:ChancellorWorf.jpg
#> # ℹ 127 more rows
worf_kehleyr <- filter(worf, grepl("Kehleyr", url))
worf_kehleyr
#> # A tibble: 1 × 2
#> `Memory Beta images (Worf)` url
#> <chr> <chr>
#> 1 WorfKehleyr.jpg File:WorfKehleyr.jpg
Qapla’! One result found.
Technically, this is not the url to an image file. It is a url that
redirects you to some other seemingly random article on the website that
happens to include the image in it. This is not necessarily a unique
instance of the image, nor is there any consistency in what portal or
type of article it takes you to. memory_beta()
, and
mb_article()
using the short form url, provide the article
content associated with the “file” url. See mb_image()
below for viewing the actual image.
x <- memory_beta(paste0(x, "/WorfKehleyr.jpg"))
x
#> # A tibble: 1 × 4
#> title content metadata categories
#> <chr> <list> <list> <list>
#> 1 Worf <xml_ndst> <tibble [1 × 17]> <tibble [47 × 2]>
x <- mb_article("File:WorfKehleyr.jpg")
x
#> # A tibble: 1 × 4
#> title content metadata categories
#> <chr> <list> <list> <list>
#> 1 Worf <xml_ndst> <tibble [1 × 17]> <tibble [47 × 2]>
x$categories
#> [[1]]
#> # A tibble: 47 × 2
#> categories url
#> <chr> <chr>
#> 1 Memory Beta good articles Category:Memory_Beta_g…
#> 2 Memory Beta articles sourced from novels Category:Memory_Beta_a…
#> 3 Memory Beta articles sourced from comics Category:Memory_Beta_a…
#> 4 Memory Beta articles sourced from episodes and movies Category:Memory_Beta_a…
#> 5 Memory Beta articles sourced from video games Category:Memory_Beta_a…
#> 6 Memory Beta articles sourced from games Category:Memory_Beta_a…
#> 7 Memory Beta articles sourced from short stories Category:Memory_Beta_a…
#> 8 Memory Beta articles sourced from RPGs Category:Memory_Beta_a…
#> 9 Memory Beta articles sourced from novelizations Category:Memory_Beta_a…
#> 10 Memory Beta articles sourced from reference works Category:Memory_Beta_a…
#> # ℹ 37 more rows
The likely intent is to obtain an image file after browsing the web
pages that list images files. Even if you are interactively browsing the
website, you have to click several times and scroll through additional
articles before you can actually view the image file that was initially
presented to you as a clickable link. This is a frustrating user
experience and confusing design. If you have a file name you want to
view, just use ma_image()
for this. It returns a ggplot
object of the image file rather than an associated article.
mb_image("File:WorfKehleyr.jpg")
mb_image()
can take the additional arguments,
keep = TRUE
to retain the downloaded image file, and
file
to specify the output filename if you do not want it
to be derived from the short URL. If you need more control over the
plot, set keep = TRUE
and then load the image file into R
directly to plot separately as needed.
In the examples above, you see the distinction between using
mb_image()
to grab images and using
memory_beta()
to navigate and parse image file page content
more generally. Similarly, mb_timeline()
is a special
function that specifically extracts and curates timeline data from
Memory Beta, in contrast to using the
memory_beta("timeline")
portal with various endpoints to
parse site pages for general content. mb_timeline()
focuses
specifically on the Memory Beta chronology/timeline tables on various
Memory Beta pages. This is a helpful function to have because Memory
Beta timeline data is not as uniformly arranged on Memory Beta as
mb_timeline()
makes it appear.
According to Memory Beta, this timeline provides a “chronological guide to all stories and established events that have taken place in the Star Trek universe. The timeline includes stories from episodes, comics, novels, and games. Note: The timeline is largely based on the Pocket Books Timeline but also includes stories from comic and games and break downs of multi-timeframed episodes which the Pocket Timeline does not include.”
Obtaining timeline data is straightforward:
x <- mb_timeline(2360)
#> 2360
x$events
#> # A tibble: 21 × 2
#> period details
#> <chr> <chr>
#> 1 2360 Events
#> 2 2360 Quark opens a bar on Terok Nor.
#> 3 2360 Politics
#> 4 2360 Conflicts
#> 5 2360 Kalisi Reyar designs and completes a grid system that watches out for…
#> 6 2360 Federation politics
#> 7 2360 Pahkwa-thanh is admitted into the Federation. [citation needed]
#> 8 2360 Cardassian politics
#> 9 2360 Miras Vara has taken the leadership of the Oralian Way as Astraea; in…
#> 10 2360 Natima Lang and Gaten Russol join the cardassian dissident movement
#> # ℹ 11 more rows
x$stories
#> # A tibble: 7 × 11
#> title title_url colleciton collection_url section context series date media
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 The Bu… The_Buri… NA NA Chapte… NA The N… 2360 novel
#> 2 Fearfu… Fearful_… NA NA Side 2… NA Deep … 2360 novel
#> 3 The Bu… The_Buri… NA NA Chapte… NA The N… 2360… novel
#> 4 Cataly… Catalyst… NA NA NA NA The L… 2360 novel
#> 5 Lefler… Lefler%2… No Limits No_Limits Twelft… NA New F… 2360… shor…
#> 6 A Stit… A_Stitch… NA NA Part 2… NA Deep … 2360 novel
#> 7 Dawn o… Dawn_of_… Terok Nor Star_Trek:_Te… Prolog… NA Deep … 2360 novel
#> # ℹ 2 more variables: notes <chr>, image_url <chr>
mb_timeline()
returns a list of two data frames. The
first contains notable historical events and has the following
columns:
period
is the year or other time period.details
is a character column that contains text lines
describing events (or section headings above events in following lines).
Set html = TRUE
to obtain this text column in a format that
includes HTML tags for a more verbose but clear section hierarchy.The stories data frame represents a timeline of published stories. Since this is timeline data, stories are in chronological order, not publication date order. The stories data frame includes:
Either events or stories may be NULL
if for a given year
no entries exist, respectively.
There are several ways to use mb_timeline()
, based
around how the timeline data is organized on Memory Beta. You saw one
way above, requesting data for the year 2360. Data can be requested
for:
mb_timeline(2360)
or
mb_timeline(2360:2364)
. Note these are integers. All other
options are character.s
to the decade, e.g.,
mb_timeline("2360s")
.mb_timeline("past")
returns timeline data only for the
“distant past” section of the timeline.mb_timeline("future")
returns timeline data only for
the “distant future” section of the timeline.mb_timeline("main")
returns timeline data only for the
main section of the timeline. This is everything except for the distant
past and future sections.mb_timeline("complete")
returns the complete timeline,
including past, main and future sections.As with stapi()
, mb_timeline()
enforces a
minimum one-second wait between any page requests. It could be read much
faster but this measure forces users to be good neighbors. It is also
recommended to try the function on a single year to see if the the
results are what you expect and need before wasting time pulling the
complete timeline. Note that using complete
(or
main
, which is almost the same since past
and
future
are relatively small) can take over ten minutes due
to the enforced wait time.
This function is also memoized, meaning its results are cached in memory for each specific call. This prevents wastefully making an identical call twice in the same R session. If you do, the cached result is returned instantly. You will not see the progress printed to the console because the function is not actually called again.
When passing integer years to mb_timeline()
this is the
only case where the argument can be a vector. All other options are
scalar, including decade. You cannot request multiple decades with
c("2360s", "2370s")
. If this is what you want, use
mb_timeline(2360:2379)
.
The year (or other period for past
and
future
) is printed to the console as
mb_timeline()
progresses through timeline data for each
time step. Set details = FALSE
to suppress this.
You can perform a Memory Beta site search using
mb_search()
. This returns a data frame of search results
content, including title, truncated text preview, and short URL for the
first page of search results.
It does not recursively collate search results through subsequent
pages of results. There could be an unexpectedly high number of pages of
results depending on the search query. Since the general nature of this
search feature seems relatively casual anyway, it aims only to provide a
first page preview. As with mb_article()
, setting
browse = TRUE
opens the page in the browser.
mb_search("Guinan")
#> # A tibble: 25 × 3
#> title text url
#> <chr> <chr> <chr>
#> 1 Guinan Guinan is a member of the El-Aurian speci… http…
#> 2 Guinan (mirror) universe, Guinan was a member of the Terr… http…
#> 3 Leonard McCoy Leonard Horatio McCoy (also known as Leon… http…
#> 4 James T. Kirk James Tiberius Kirk was a Human man who w… http…
#> 5 Indistinguishable from Magic Indistinguishable from Magic. In 2161, sh… http…
#> 6 Montgomery Scott Spoiler warning: Plot and/or ending detai… http…
#> 7 USS Enterprise (NCC-1701-D) Spoiler warning: Plot and/or ending detai… http…
#> 8 Geordi La Forge Geordi La Forge was a male Human who was … http…
#> 9 2374 2374 was, on Earth's calendar, the 75th y… http…
#> 10 Reginald Barclay Spoiler warning: Plot and/or ending detai… http…
#> # ℹ 15 more rows
Memory Beta contains over 50,000 pages at the time of this
rtrek
version. It is possible that some articles may have
idiosyncratic structure that could make them inaccessible by these
rtrek
functions.
Since this package version is also the first to offer this brand new functionality - and as mentioned, Memory Beta does not offer an API, leading to a less reliable web-scraping approach, it is unknown what the likelihood is at this time of breaking changes occurring during updates to Memory Beta by its maintainers.
Jolan Tru.