Read patterns

While the approach 1 table, 1 file works for small tables, say less than a million lines, it rapidly becomes impractical for larger tables, as the daily data refresh ends up requiring you to move impractically large amounts of data around. In those situations, it becomes necessary to split a table into multiple files, typically following some kind of incremental split of the data. In order to support this practice, Envision offers the file patterns mechanism to consolidate many files into a single file.

Table of contents

Wildcard patterns

Let’s start by generating 3 files that we intend to re-consolidate later on:

table Products = with
  [| as Product, as Price |]
  [| "shirt", 10.50 |]
  [| "pants", 15.00 |]
  [| "hat", 5.25 |]

where Products.Product == "shirt"
  show table "Products 1" export: "/sample/products-1.csv" with
    Products.Product
    Products.Price

where Products.Product == "pants"
  show table "Products 2" export: "/sample/products-2.csv" with
    Products.Product
    Products.Price

where Products.Product == "hat"
  show table "Products 3" export: "/sample/products-3.csv" with
    Products.Product
    Products.Price

Each file has the same format (i.e. same columns), and the file names are essentially similar, only diverging by a suffix inserted just prior to the file extension. These files can be read in a single statement as illustrated by:

read "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product
  Products.Price

Which results in the following display:

ProductPrice
shirt10.5
pants15
hat5.3

In the above script, the wildcard *(star) character is used to capture any of the three filename variants, which ends up consolidated in a single Products table.

The wildcard * matches any character as long as it’s not a /, which was used to separate folders. The file pattern can also include multiple wildcards. This behavior is heavily inspired from the behavior of the ls command line in Linux and the dir command line in Windows.

The sequence patterns found read statements matter because each file found in the Lokad account can only be matched once by any pattern. In other words, as soon as a file is matched by a read statement, the file is removed from the list of eligible files for later read statements.

This single match behavior is illustrated by the following script:

read "/sample/products-1.csv" as Discarded with
  Product : text
  Price : number

read "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product
  Products.Price

Which results in the following display:

ProductPrice
pants15
hat5.3

The shirt line does not appear any more because its corresponding file has been matched through the Discarded read statement, which precedes the Products read statement.

Pattern unions

Heterogeneous sets of can that cannot be captured by a single file pattern, or that do not follow the same format (more on this in the following) can still be consolidated in a single table through multiple read statements. The consolidation is achieved through a mechanism referred to as pattern unions.

Let’s leverage the file created in the previous section to illustrate how the Products table can be read from two distinct read statements:

read "/sample/products-1.csv" as Products with
  Product : text
  Price : number

read "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product

Which results in the following display:

ProductPrice
shirt10.5
pants15
hat5.3

In the above script, the table Products appears twice and, as a result, Envision ends up consolidating all the data originating from the two sets of files into a single table.

However, as all the files follow the exact same format, the declaration of the column in the second read statement is redundant, and merely mirrors the content of the first read statement. Envision offers a mechanism to avoid this redundancy altogether:

read "/sample/products-1.csv" as Products with
  Product : text
  Price : number

read "/sample/products-*.csv" as Products

show table "Products" a1b3 with
  Products.Product
  Products.Price

This script above is strictly equivalent to the previous one. The second read statement omits entirely the columns that are inferred by Envision from the previous statement.

The two main use cases for resorting to pattern unions are either that files are dispersed among folders that can’t be captured at once with a single file pattern, or that there are various file formats involved. The example script provided in this section does not fall into either one of those two use cases, the sole point of this example is to illustrate the syntax. The varying format use case will be discussed in the “File Formats” section below.

Advanced remark: The first use case, the more general filename capture, could also have been addressed by the use of a more general pattern matching syntax such as regular expressions (regex). However, a general matching syntax presents a performance risk whenever millions of filenames have been matched against arbitrary complex rules. A guiding design principle for Envision is predictible performance, and the performance of Envision has to remain fully decoupled from the overall number of files present in the account - as long as those files aren’t actually contributing to the script calculations.

Multi-table reads

Each file can only be captured once in a file pattern. However, in certain cases, the same file does contribute to several tables at the same time. Thus, in order to accommodate these situations, Envision provides a multi-table read mechanism.

Let’s revisit the script that generates a short list of variants:

table Variants = with
  [| as Product, as Color |]
  [| "shirt", "white" |]
  [| "shirt", "pink" |]
  [| "pants", "blue" |]
  [| "pants", "black" |]
  [| "hat", "red" |]

show table "Variants" export: "/sample/variants.csv" with
  Variants.Product
  Variants.Color

The file variants.csv can be loaded in two tables with:

read "/sample/variants.csv" as Products with
  Product : text
and as Variants with
  Product : text
  Color : text

show table "Variants" with
  Variants.Product
  Variants.Color

In the above script, the and keyword is used to introduce a continuation of the read statement associated with a second table. This keyword effectively shares the file pattern between the two tables Products and Variants. The and keyword can be repeated in order to share the same file among more than two tables as illustrated by:

read "/sample/variants.csv" as Products with
  Product : text
and as Variants with
  Product : text
  Color : text
and as Colors with
  Color : text

The and keyword can be used to introduce one or more read blocks without file patterns. The original file pattern from the first read block is reused instead. This usage of the keyword and without read statements should not be confused with the Boolean operator that occurs within expressions.

Pattern filters

When a wildcard is used, the read pattern can capture an arbitrarily large number of files. The pattern filters allow you to restrict the capture to a subset of the original set. Usually, this behavior proves useful when the intent is to restrict the read statement to the “last” N files.

Leveraging again the three files created in the previous section, the following script only read two files out of the three:

read max 2 "/sample/products-*.csv" as Products with
  Product : text
  Price : number

show table "Products" a1b3 with
  Products.Product
  Products.Price

Which results in the following display:

ProductPrice
pants15
hat5.3

The max 2 option that follows the keyword read indicates that within all the files captured by the pattern, only the last two; based on lexicographic ordering, are kept.

More generally, the file pattern supports three filters max, min and latest. Both min and max sort files by name, while latest sorts files against their last update datetime (most recent updates first). The integer that follows the filter can be omitted, the default value being 1.

Combined with the single match behavior, the filters offer the possibility to select a range of files. By capturing files in a first pattern (and typically ignoring them afterward), the initial capture acts as a negative condition on the second pattern.

Roadmap: The pattern filters are somewhat unsatisfying because they don’t really connect to the rest of the Envision syntax. Both patterns and filters aren’t as expressive as we would like them to be, and as a result, the range selection is not a completely straightforward matter. Rather than adding more options, we will probably revisit this bit of syntax entirely in the future.

Pattern expressions

The read statement takes a file pattern - a text argument - as input. So far, all the file patterns that we have used were text literals. However, there are situations where a text expression is preferable. For example, the folder might be the same for many read statements, and it might be desirable to change the target folder without revisiting every single read statement in the script. The pattern expressions are intended precisely for this purpose.

Let’s re-introduce a short script to generate a file:

table Products = with
  [| as Product, as Price |]
  [| "shirt", 10.50 |]
  [| "pants", 15.00 |]
  [| "hat", 5.25 |]

show table "Products" export: "/sample/products.csv" with
  Products.Product
  Products.Price

This file can then be read with the script below:

myFolder = "/sample"

read "\{myFolder}/products.csv" as Products with
  Product : text
  Price : number

show table "Products" with
  Products.Product
  Products.Price

In the above script, the folder is isolated in a scalar text variable named myFolder. This variable is then injected in the file pattern itself through the usual text interpolation mechanism that we have seen in an earlier section.

The variables that contribute to the file patterns have to be compile-time constants. As a rule of thumb, in Envision any basic operation, say a text concatenation, that involves only constant arguments produces a constant result. This mechanism is known as constant propagation.

It is also possible to directly pass a text expression (instead of a text literal) as the file pattern, as done in the following script:

myFolder = "/sample"
myPath = "\{myFolder}/products.csv"

read (myPath) as Products with
  Product : text
  Price : number

show table "Products" with
  Products.Product
  Products.Price

Due to parsing and considering the design of Envision itself, unless it’s a text literal, the text expression of the file pattern has to be surrounded by parentheses ( and ).

Roadmap: The file count integer associated with the max, min and latest option has to be a number literal. It does not benefit from any constant propagation mechanism. However, it’s unlikely that this feature will be introduced later on as such. Such a capability is more likely to emerge from the larger decoupling overhaul of the read statements (decoupling patterns from formats).