DESCRIPTION Blabla...
Usage
add_toc(
md,
min_tier = 2L,
max_tier = 6L,
position = "above",
md_flavor = c("github", "gitlab"),
add_title = TRUE,
title = "Table of contents",
title_tier = min_tier,
add_backlinks = add_title,
backlink_strings = c("↑", "↓"),
backlink_position = c("before", "after"),
listing_style = c("-", "*", "ordered", "indented"),
toc_id = "toc",
old_toc_id = toc_id
)
Arguments
- md
(R) Markdown document to be processed as a single file path, a single URL or a character vector (one string per line).
- min_tier
Minimum tier of headers (
<h1>
–<h6>
) to include in the TOC. Integer between1
and6
.min_tier = 2
for example means to create TOC entries for all<h2>
and below headers.- max_tier
Maximum tier of headers (
<h1>
–<h6>
) to include in the TOC. Integer between1
and6
.max_tier = 5
for example means to create TOC entries for all headers down to<h5>
.max_tier
must be >=min_tier
.- position
Position in the Markdown document at which to add the TOC. Possible values include:
"top"
: The very beginning of the document, i.e. the first line."bottom"
: The very end of the document, i.e. the last line."above"
: Above the lines between the uppermost header of tier <=min_tier
and the next header above (if any)."below"
: Below the lines between the uppermost header of tier <=min_tier
and the next header above (if any), i.e. right above the uppermost header of tier <=min_tier
."none"
: Only remove a possibly existing TOC.A line number, given as a positive integer.
- md_flavor
Markdown flavor. Possible values include:
"github"
:"gitlab"
:
- add_title
Include a TOC title? Logical. Note that no backlinks are added at all if
add_title = FALSE
and no header line are found aboveposition
, regardless ofadd_backlinks = TRUE
.- title
Title of the TOC. A character scalar.
- title_tier
Tier/formatting of TOC title. Possible values include:
An integer between
1L
and6L
representing the<h1>
–<h6>
tier."regular"
: Simple unformatted non-header text."bold"
: Bold (<strong>
) non-header text."italic"
: Italic (<em>
) non-header text.
- add_backlinks
Add a link back to the TOC to each Markdown header. A logical scalar. Note that if
add_backlinks = TRUE
andadd_title = FALSE
, as a fallback the backlinks point to the next header line aboveposition
(if any found). This will also be the case ifmd_flavor = "gitlab"
andtitle_tier
is set to a non-header value ("regular"
,"bold"
or"italic"
) because GitLab currently ignores manually set HTML<id>
attributes.- backlink_strings
String(s) to use as link text back to the TOC. A character vector of length 1 or 2. If two strings are provided, the first one will be used for backlinks below
position
, the second one for backlinks aboveposition
.- backlink_position
Position of the backlinks. Possible values include:
"before"
: Before the actual header text."after"
: After the actual header text.
- listing_style
Format to use for listing the TOC entries. Possible values include:
"-"
: Create an unordered list using a hyphen as listing symbol."*"
: Create an unordered list using an asterisk as listing symbol."ordered"
: Create an ordered list using1
,2
,3
, ... as listing symbols."indented"
: Use non-breaking spaces (
) to create visual indentation. Useful if the headers are already numbered.
- toc_id
HTML
<id>
attribute of the TOC title iftitle_tier
is set to a non-header value ("regular"
,"bold"
or"italic"
). A character scalar.- old_toc_id
HTML
<id>
attribute of the old TOC title (in order to have old backlinks with an ID other thantoc_id
removed). A character scalar.
Details
This function tries to adhere to the CommonMark specification, i.e. to interpret the Markdown syntax the same way as the commonmark.js reference implementation under <try.commonmark.org> does.
Examples
md <- paste0("https://raw.githubusercontent.com/ropensci/pdftools/",
"e7248d9956c7e73968628fa3a8ed37f0a8c23b37/README.md")
md |>
tocr::add_toc(position = 9) |>
pal::cat_lines()
#> # pdftools
#>
#> [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
#> [![Build Status](https://travis-ci.org/ropensci/pdftools.svg?branch=master)](https://travis-ci.org/ropensci/pdftools)
#> [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pdftools?branch=master&svg=true)](https://ci.appveyor.com/project/jeroen/pdftools)
#> [![Coverage Status](https://codecov.io/github/ropensci/pdftools/coverage.svg?branch=master)](https://codecov.io/github/ropensci/pdftools?branch=master)
#> [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdftools)](http://cran.r-project.org/package=pdftools)
#> [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/pdftools)](http://cran.r-project.org/web/packages/pdftools/index.html)
#>
#> <!-- TOC BEGIN -- leave this comment untouched to allow auto update -->
#>
#> ## Table of contents
#>
#> - [Introduction](#-introduction)
#> - [Installation](#-installation)
#> - [Building from source](#-building-from-source)
#> - [Getting started](#-getting-started)
#> - [Bonus feature: rendering pdf](#-bonus-feature-rendering-pdf)
#> - [Limitations](#-limitations)
#>
#> <!-- TOC END -- leave this comment untouched to allow auto update -->
#>
#> ## [↑](#table-of-contents) Introduction
#>
#> Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.
#>
#> The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.
#>
#>
#> ## [↑](#table-of-contents) Installation
#>
#> On Windows and Mac the binary packages can be installed directly from CRAN:
#>
#> ```r
#> install.packages("pdftools")
#> ```
#>
#> Installation on Linux requires the poppler development library. On Debian/Ubuntu:
#>
#> ```
#> sudo apt-get install libpoppler-cpp-dev
#> ```
#>
#> If you want to install the package from source on Mac OS-X you need brew:
#>
#> ```
#> brew install poppler
#> ```
#>
#> On Fedora:
#>
#> ```
#> sudo yum install poppler-cpp-devel
#> ```
#>
#> ### [↑](#table-of-contents) Building from source
#>
#> On CentOS the `libpoppler-cpp` library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.
#>
#> ```sh
#> # Build dependencies
#> yum install wget xz libjpeg-devel openjpeg2-devel
#>
#> # Download and extract
#> wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
#> tar -Jxvf poppler-0.47.0.tar.xz
#> cd poppler-0.47.0
#>
#> # Build and install
#> ./configure
#> make
#> sudo make install
#> ```
#>
#> By default libraries get installed in `/usr/local/lib` and `/usr/local/include`. On CentOS this is not a default search path so we need to set `PKG_CONFIG_PATH` and `LD_LIBRARY_PATH` to point R to the right directory:
#>
#> ```sh
#> export LD_LIBRARY_PATH="/usr/local/lib"
#> export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
#> ```
#>
#> We can then start R and install `pdftools`.
#>
#> ## [↑](#table-of-contents) Getting started
#>
#> The `?pdftools` manual page shows a brief overview of the main utilities. The most important function is `pdf_text` which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.
#>
#> ```r
#> library(pdftools)
#> download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
#> txt <- pdf_text("1403.2805.pdf")
#>
#> # first page text
#> cat(txt[1])
#>
#> # second page text
#> cat(txt[2])
#> ```
#>
#> In addition, the package has some utilities to extract other data from the PDF file. The `pdf_toc` function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:
#>
#> ```r
#> # Table of contents
#> toc <- pdf_toc("1403.2805.pdf")
#>
#> # Show as JSON
#> jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
#> ```
#>
#> Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.
#>
#>
#> ```r
#> # Author, version, etc
#> info <- pdf_info("1403.2805.pdf")
#>
#> # Table with fonts
#> fonts <- pdf_fonts("1403.2805.pdf")
#> ```
#>
#> ## [↑](#table-of-contents) Bonus feature: rendering pdf
#>
#> A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use `pdf_render_page` to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.
#>
#> ```r
#> # renders pdf to bitmap array
#> bitmap <- pdf_render_page("1403.2805.pdf", page = 1)
#>
#> # save bitmap image
#> png::writePNG(bitmap, "page.png")
#> jpeg::writeJPEG(bitmap, "page.jpeg")
#> webp::write_webp(bitmap, "page.webp")
#> ```
#>
#> This feature is still experimental and currently does not work on Windows.
#>
#> ## [↑](#table-of-contents) Limitations
#>
#> Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data.
#>
#> ```r
#> txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")
#>
#> # some tables
#> cat(txt[18])
#> cat(txt[19])
#> ```
#>
#> Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper.
#>
#>
#> [![](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)
md |>
tocr::add_toc(listing_style = "ordered") |>
pal::cat_lines()
#> # pdftools
#>
#> <!-- TOC BEGIN -- leave this comment untouched to allow auto update -->
#>
#> ## Table of contents
#>
#> 1. [Introduction](#-introduction)
#> 2. [Installation](#-installation)
#> 1. [Building from source](#-building-from-source)
#> 1. [Getting started](#-getting-started)
#> 2. [Bonus feature: rendering pdf](#-bonus-feature-rendering-pdf)
#> 3. [Limitations](#-limitations)
#>
#> <!-- TOC END -- leave this comment untouched to allow auto update -->
#>
#> [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
#> [![Build Status](https://travis-ci.org/ropensci/pdftools.svg?branch=master)](https://travis-ci.org/ropensci/pdftools)
#> [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pdftools?branch=master&svg=true)](https://ci.appveyor.com/project/jeroen/pdftools)
#> [![Coverage Status](https://codecov.io/github/ropensci/pdftools/coverage.svg?branch=master)](https://codecov.io/github/ropensci/pdftools?branch=master)
#> [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/pdftools)](http://cran.r-project.org/package=pdftools)
#> [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/pdftools)](http://cran.r-project.org/web/packages/pdftools/index.html)
#>
#> ## [↑](#table-of-contents) Introduction
#>
#> Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.
#>
#> The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.
#>
#>
#> ## [↑](#table-of-contents) Installation
#>
#> On Windows and Mac the binary packages can be installed directly from CRAN:
#>
#> ```r
#> install.packages("pdftools")
#> ```
#>
#> Installation on Linux requires the poppler development library. On Debian/Ubuntu:
#>
#> ```
#> sudo apt-get install libpoppler-cpp-dev
#> ```
#>
#> If you want to install the package from source on Mac OS-X you need brew:
#>
#> ```
#> brew install poppler
#> ```
#>
#> On Fedora:
#>
#> ```
#> sudo yum install poppler-cpp-devel
#> ```
#>
#> ### [↑](#table-of-contents) Building from source
#>
#> On CentOS the `libpoppler-cpp` library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.
#>
#> ```sh
#> # Build dependencies
#> yum install wget xz libjpeg-devel openjpeg2-devel
#>
#> # Download and extract
#> wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
#> tar -Jxvf poppler-0.47.0.tar.xz
#> cd poppler-0.47.0
#>
#> # Build and install
#> ./configure
#> make
#> sudo make install
#> ```
#>
#> By default libraries get installed in `/usr/local/lib` and `/usr/local/include`. On CentOS this is not a default search path so we need to set `PKG_CONFIG_PATH` and `LD_LIBRARY_PATH` to point R to the right directory:
#>
#> ```sh
#> export LD_LIBRARY_PATH="/usr/local/lib"
#> export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
#> ```
#>
#> We can then start R and install `pdftools`.
#>
#> ## [↑](#table-of-contents) Getting started
#>
#> The `?pdftools` manual page shows a brief overview of the main utilities. The most important function is `pdf_text` which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.
#>
#> ```r
#> library(pdftools)
#> download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
#> txt <- pdf_text("1403.2805.pdf")
#>
#> # first page text
#> cat(txt[1])
#>
#> # second page text
#> cat(txt[2])
#> ```
#>
#> In addition, the package has some utilities to extract other data from the PDF file. The `pdf_toc` function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:
#>
#> ```r
#> # Table of contents
#> toc <- pdf_toc("1403.2805.pdf")
#>
#> # Show as JSON
#> jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
#> ```
#>
#> Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.
#>
#>
#> ```r
#> # Author, version, etc
#> info <- pdf_info("1403.2805.pdf")
#>
#> # Table with fonts
#> fonts <- pdf_fonts("1403.2805.pdf")
#> ```
#>
#> ## [↑](#table-of-contents) Bonus feature: rendering pdf
#>
#> A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use `pdf_render_page` to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.
#>
#> ```r
#> # renders pdf to bitmap array
#> bitmap <- pdf_render_page("1403.2805.pdf", page = 1)
#>
#> # save bitmap image
#> png::writePNG(bitmap, "page.png")
#> jpeg::writeJPEG(bitmap, "page.jpeg")
#> webp::write_webp(bitmap, "page.webp")
#> ```
#>
#> This feature is still experimental and currently does not work on Windows.
#>
#> ## [↑](#table-of-contents) Limitations
#>
#> Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data.
#>
#> ```r
#> txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")
#>
#> # some tables
#> cat(txt[18])
#> cat(txt[19])
#> ```
#>
#> Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper.
#>
#>
#> [![](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)