Read patterns

While the approach 1 table, 1 file works for small tables, say less than a million lines, it rapidly becomes impractical for larger tables, as the daily data refresh ends up requiring you to move impractically large amounts of data around. In those situations, it becomes necessary to split a table into multiple files, typically following some kind of incremental split of the data. In order to support this practice, Envision offers the file patterns mechanism to consolidate many files into a single file.

Table of contents

Wildcard patterns

Let’s start by generating 3 files that we intend to consolidate later on:

table Products = with
  [| as Product, as Price |]
  [| "shirt", 10.50 |]
  [| "pants", 15.00 |]
  [| "hat", 5.25 |]

schema Products with
  Product : text
  Price : number

where Products.Product == "shirt"
  write Products as "/sample/products-1.csv" with
    schema Products

where Products.Product == "pants"
  write Products as "/sample/products-2.csv" with
    schema Products

where Products.Product == "hat"
  write Products as "/sample/products-3.csv" with
    schema Products

Each file has the same format (i.e. same columns), and the file names are essentially similar, only diverging by a suffix inserted just prior to the file extension. These files can be read in a single statement as illustrated by:

read "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product
  Products.Price

Which results in the following display:

ProductPrice
shirt10.5
pants15
hat5.3

In the above script, the wildcard *(star) character is used to capture any of the three filename variants, which ends up consolidated in a single Products table.

The wildcard * matches any character as long as it’s not a /, which was used to separate folders. The file pattern can also include multiple wildcards. This behavior is heavily inspired from the behavior of the ls command line in Linux and the dir command line in Windows.

A single script can contain multiple read statements. The ordering of those read statements has no impact on the set of files captured by each statement. In fact, the ordering of the read statements has almost no impact whatsoever. There is a single exception to this principle of unordered read statements with column inferences (see next section).

If a file is captured multiple times through multiple read statements, the content of the file is loaded multiple times.

Pattern unions

Heterogeneous sets of files that cannot be captured by a single file pattern, or that do not follow the same format, can still be consolidated in a single table through multiple read statements. The consolidation is achieved through a mechanism referred to as pattern unions.

Let’s leverage the file created in the previous section to illustrate how the Products table can be read from two distinct read statements:

read "/sample/products-1.csv" as Products with
  Product : text
  Price : number

read "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product
  Products.Price

Which results in the following display:

ProductPrice
shirt10.5
shirt10.5
pants15
hat5.3

In the above script, the table Products appears twice and, as a result, Envision ends up consolidating all the data originating from the two sets of files into a single table. The file products-1.csv is read twice.

However, as all the files follow the exact same format, the declaration of the column in the second read statement is redundant, and merely mirrors the content of the first read statement. Envision offers a mechanism to avoid this redundancy altogether:

read "/sample/products-1.csv" as Products with
  Product : text
  Price : number

read "/sample/products-*.csv" as Products

show table "Products" a1b3 with
  Products.Product
  Products.Price

This script above is strictly equivalent to the previous one. The second read statement omits entirely the columns that are inferred by Envision from the previous statement.

The inference of the colums is the exception to the general principles that read statements are unordered, as the statement that specifies the column must come first.

The two main use cases for resorting to pattern unions are either that files are dispersed among folders that can’t be captured at once with a single file pattern, or that there are various file formats involved. The example script provided in this section does not fall into either one of those two use cases, the sole point of this example is to illustrate the syntax. The varying format use case will be discussed in the “File Formats” section below.

Advanced remark: The first use case, the more general filename capture, could also have been addressed by the use of a more general pattern matching syntax such as regular expressions (regex). However, a general matching syntax presents a performance risk whenever millions of filenames have been matched against arbitrary complex rules. A guiding design principle for Envision is predictible performance, and the performance of Envision has to remain fully decoupled from the overall number of files present in the account - as long as those files aren’t actually contributing to the script calculations.

Pattern filters

When a wildcard is used, the read pattern can capture an arbitrarily large number of files. The pattern filters allow you to restrict the capture to a subset of the original set. Usually, this behavior proves useful when the intent is to restrict the read statement to the “last” N files.

Leveraging again the three files created in the previous section, the following script only read two files out of the three:

read max 2 "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product
  Products.Price

Which results in the following display:

ProductPrice
pants15
hat5.3

The max 2 option that follows the keyword read indicates that within all the files captured by the pattern, only the last two; based on lexicographic ordering, are kept.

More generally, the file pattern supports three filters max, min and latest. Both min and max sort files by name, while latest sorts files against their last update datetime (most recent updates first). The integer that follows the filter can be omitted, the default value being 1.

Combined with the single match behavior, the filters offer the possibility to select a range of files. By capturing files in a first pattern (and typically ignoring them afterward), the initial capture acts as a negative condition on the second pattern.

Roadmap: The pattern filters are somewhat unsatisfying because they don’t really connect to the rest of the Envision syntax. Both patterns and filters aren’t as expressive as we would like them to be, and as a result, the range selection is not a completely straightforward matter. Rather than adding more options, we will probably revisit this bit of syntax entirely in the future.

Pattern expressions

The read statement takes a file pattern - a text argument - as input. So far, all the file patterns that we have used were text literals. However, there are situations where a text expression is preferable. For example, the folder might be the same for many read statements, and it might be desirable to change the target folder without revisiting every single read statement in the script. The pattern expressions are intended precisely for this purpose.

Let’s re-introduce a short script to generate a file:

table Products = with
  [| as Product, as Price |]
  [| "shirt", 10.50 |]
  [| "pants", 15.00 |]
  [| "hat", 5.25 |]

write Products as "/sample/products.csv" with
  Product = Products.Product
  Price = Products.Price

This file can then be read with the script below:

myFolder = "/sample"

read "\{myFolder}/products.csv" as Products with
  Product : text
  Price : number

show table "Products" with
  Products.Product
  Products.Price

In the above script, the folder is isolated in a scalar text variable named myFolder. This variable is then injected in the file pattern itself through the usual text interpolation mechanism that we have seen in an earlier section.

The variables that contribute to the file patterns have to be compile-time constants. As a rule of thumb, in Envision any basic operation, say a text concatenation, that involves only constant arguments produces a constant result. This mechanism is known as constant propagation.

It is also possible to directly pass a text expression (instead of a text literal) as the file pattern, as done in the following script:

myFolder = "/sample"
myPath = "\{myFolder}/products.csv"

read (myPath) as Products with
  Product : text
  Price : number

show table "Products" with
  Products.Product
  Products.Price

Due to parsing and considering the design of Envision itself, unless it’s a text literal, the text expression of the file pattern has to be surrounded by parentheses ( and ).

Roadmap: The file count integer associated with the max, min and latest option has to be a number literal. It does not benefit from any constant propagation mechanism. However, it’s unlikely that this feature will be introduced later on as such. Such a capability is more likely to emerge from the larger decoupling overhaul of the read statements (decoupling patterns from formats).