Reading files

Envision loads and processes tabular files that are stored in your Lokad account. Furthermore, Lokad provides native support for a series of third party apps. When your app is supported by Lokad, importing data and creating the necessary files is a process that is completely managed by us internally. However, when this is not the case, or when Lokad cannot retrieve all the relevant data, it is possible to import files into Lokad manually. This page details how these tabular files should be formatted to be compatible with Envision. The Envision syntax relative to how these tabular files are read by the system is also documented below.

In brief

The file formats supported by Lokad are:

The ZIP format has a few limitations: .zip files can be read but not written. The ZIP archive can only contain one file at most.

The syntax to read files in Envision is:

read "/foo/myorders.csv.gz" as Orders with
  "My Id" as Id : text
  Quantity : number

// wildcard '*': here, multiple files will be read
read "/foo/promos*.xlsx" as Promos with
  Mechanism: text

// max within the wildcard '*': only one file is read
read max "/foo/stocks*.xlsx" as Stocks with
  OnHand: number

// 'unsafe': tolerates parsing issues
read "/foo/backorders.tsv.gz" unsafe as BO expect [Date] with
  Date: date

A file storage with connectors

Every Lokad account comes with its own file storage service. Simply put, it is possible to create folders, and to upload files into these folders. In this respect, Lokad offers a service similar to other file-hosting services, like Box or Dropbox, except that most file-sharing features are absent. In fact, the intent of Lokad’s file storage system is not to be another file-sharing app, but to provide a fully transparent way of having access to all the data used by Lokad whenever your commerce is being analyzed. Thus, whenever a forecast or a dashboard is generated by Lokad, the corresponding data exists within your account in the form of files that can be downloaded, and possibly analyzed, independently of Lokad.

For smaller businesses, producing files that gather all the relevant historical data of the company can be a very tedious exercise, because such businesses do not always have significant IT resources in the first place. Thus, for many popular commerce apps - Brightpearl, Linnworks, QuickBooks, TradeGecko, Unleashed, Vend … to name a few - Lokad provides built-in connectors. These connectors can connect to the app, typically using an online API (Application Programming Interface), and generate, within a Lokad account, files that are directly formatted in a way that make them most suitable to be processed by Envision.

If your business app is not supported (if that is the case drop us an email, as we tend to support the apps that are most frequently requested), or if Lokad’s built-in support does not cover all the relevant data, in this case it is possible to upload files directly into Lokad as we support manual uploads through the web. However, as data will have to be refreshed on a regular basis, it is a lot more practical if the data transfer can take place in a fully automated manner. Therefore, Lokad also supports protocols such as FTP and SFTP that offer the possibility to automate all file transfers.

Extensive support for tabular files

Lokad supports quite a diverse range of file formats that can contain tabular data. We support popular formats such as Excel sheets or CSV (comma-separated value) files. If you have experience with flat file import/export, you might already know that there are a myriad of small technical details that can make the process quite tedious:

We have designed our system to be capable of auto-detecting all the technicalities such as encoding, date and number formats, delimiters … While everything happens automatically, it is a fairly non-trivial process that takes place within Lokad when files are read by our system. For example, we are now supporting the auto-detection of more than 100 distinct date formats.

In practice, the guidelines for the tabular file format are simple: the column names should be listed as the first line of the file (this requirement applies to Excel sheets too), and one has to make sure that the token values do not collide with delimiters when using flat text files such as CSV or TSV.

Also, when flat files grow in size, it might become more practical to compress them before uploading them to Lokad. Lokad supports flat text files compressed with GZip as long as the .gz extension is added at the end of the file name. For example, Lokad_Items.tsv.gz is recognized as a TSV file that is compressed with GZip. The same pattern works with the .zip extension for WinZip archives. In the case of WinZip, the archives should only contain a single compressed file; Envision does not support multi-file archives. Excel sheets are already compressed, so there is no need to compress them, even if they happen to be very large.

File and column naming guidelines

Lokad can accommodate pretty much any file name and any column name, but if certain guidelines are followed, the resulting Envision script can be made a lot simpler. The sample dataset is a good example of a set of files that follow these guidelines. If you have not yet had a chance to look at this dataset until now, we suggest to start here.

The files are expected to be named following this pattern: in, the TableName is replaced by the name of the table, and xyz is replaced by the actual file extension, for example xlsx for Excel sheets. The table name should not contain any spaces, and it should not begin with a digit either. By following these file naming conventions, Envision is capable of auto-detecting the relevant tables to be uploaded into Lokad without any further effort.

In particular, the Items table is a special case, and for many Envision scripts, there are benefits to structuring your data around such an Items table. However, the Items table is not required, and Envision can run without loading such a table. If a file named is provided to Envision, it will be treated as the Items table by default.

If a table happens to be split into multiple files, for example because the individual files are too large, then, Envision can consolidate all these files into one table. In order to achieve this, all these files should be named where the Suffix varies from one file to another. For example, if daily extracts are produced for sales extracts, then one of the daily sales extracts is likely to be named Lokad_Orders_2015-03-16.tsv. The Suffix plays no specific role except for keeping the files distinct, so anything goes here. Naturally, Envision expects to find the same columns across all files varying only by their suffix, otherwise Envision cannot consolidate these files into one table without additional instructions.

The column names come with little expectations from an Envision point of view. Envision also provides some default behaviors:

Most column names are acceptable as long as they do not contain spaces and do not begin with a digit. Envision also provides ways to override the behaviors listed above. However, whenever possible, we suggest sticking to the guidelines, as it will reduce the amount of scripting overhead required within Envision to start getting results.

Syntax to read files

Envision supports a rich syntax for reading files that can handle situations when our naming guidelines cannot be followed (e.g. when the files cannot be internally modified). The syntax is outlined below. It is not yet the fully general syntax; some additional options are detailed below.

read "/foo/bar*.xyz" as MyTable expect [col, colo, col] with
  "Foo1" as Id : text
  "Foo2" as Date : date
  "Foo 3" as Foo3 : text

The /foo/bar*.xyz path may contain a wildcard (*), and this wildcard can be replaced by any sequence of characters. Using a wildcard is optional, but when used, the wildcard offers the possibility of capturing multiple files at once. This pattern is similar to the shell syntax when listing files from the command-line.

The first as keyword, found right after the path, indicates the name of the table. If the table is named MyTable, then the Envision script will refer to this table by writing MyTable.Id for example. If the table to be uploaded is the Items table itself, then this as MyTable can be skipped.

The instructions that come afterwards in the form of the with keyword act as column renaming instructions. In the example above, the Foo1 column is renamed as Id and the Foo2 column is renamed as Date. Through renaming these two columns, the MyTable table becomes a legitimate Envision table containing historical data as both the Id and Date columns are now properly defined.

The third renaming operation takes place as Foo 3 (notice the space in the original column name) is renamed into Foo3. This illustrates how an incorrect column name can be turned into a correct one. It is also important to note that vector variable names do not allow for spaces. For the sake of concision, we did not include any further renamed pairings, but actually, as long as the pairings are properly separated on different lines (or delimited by a comma), there is no limit to the number of renaming operations that can take place for any single file. In practice, renaming a column can be useful for making a script more readable, but it can also be used to make the required adjustments when it is necessary to consolidate multiple files into one table, while these files have inconsistent column names.

For example, let’s assume that the order history is split between two files. First, we have Lokad_Orders_Old.tsv that contains all the data up to December 31st 2018. This file has three columns named ItemId, OrderDate and Quantity, respectively. This file’s columns do not follow Envision’s column naming guidelines. Second, we have Lokad_Orders.tsv, a more recent file, that contains all the history starting from January 1st 2019. This file contains three columns too, and these columns are named Id, Date and Quantity, which fits with Envision’s default guidelines. The script below illustrates how the two files can be consolidated into a single Orders table.

read "/foo/Lokad_Orders_Old.tsv" as Orders expect [Id, Date] with
  "ItemId" as Id : text
  "OrderDate" as Date : date

read "/foo/Lokad_Orders.tsv" as Orders

Note that the table type option can be omitted in the first read file. In this case, the default type applied is [Id, Date] as explained in the Table type options section. However, when reading two files as one table, you must omit the [] table type option when reading the second file. Here is an example:

read "/foo/Lokad_Orders_Old.tsv" as Orders expect [Id, date] with
  "ItemId" as Id : text
  "OrderDate" as Date : date

read "/foo/Lokad_Orders.tsv" as Orders

Another common configuration that can be faced while consolidating several files is a column that is present only in some of them. In this case, Envision allows to include the missing column in the read statement, by attributing a constant value. Let us imagine, for example, that the new file Lokad_Orders.tsv contains a column SaleChannel, which is missing in Lokad_Orders_Old.tsv, or that we need to introduce a flag to distinguish the two files (useful when consolidating an incremental file with the full history). The two files can then be read as follows

read "/foo/Lokad_Orders_Old.tsv" as Orders expect [Id, Date] with
  "ItemId" as Id : text
  "OrderDate" as Date : date
  const("default") as SaleChannel
  const(false) as IsNewFile

read "/foo/Lokad_Orders.tsv" as Orders with
  const(true) as IsNewFile

where, in general, the argument of const could be of type text, boolean, number, or date. The same way as we declare the const(number) and const(text), it is possible to declare const(date). To handle date-type missing column, you can declare the const(date) in read options as follows:

read "foo/Lokad_Prices.tsv" as Prices expect [Id] with
  "ItemId" as Id : text
  const(2019,01,30) as Date

In addition, one can read a particular file, e.g., Lokad_Orders.tsv, as two distinct ones, e.g. Orders and NewOrders.

read "/sample/Lokad_Orders.tsv" as Orders expect [Id, Date] with
  Id : text
  Date : date
  Quantity : number
  NetAmount : number
  Client : text
  Currency : text
and as NewOrders expect [Id, Date] with
  Id : text
  Date : date
  "Client" as CustomerID : text
  "Quantity" as QtyOrdered : number

Naturally, it is possible to define as many read statements as it takes to properly cover all the files and all the tables. Envision does not impose a precise location for these statements within your script, so technically they could be placed anywhere, even at the very bottom of the script. However, we advise to keep these statements at the beginning of the scripts because this is where people familiar with programming languages would expect to find them.

Denormalize multi-value cells

Sometimes, input files adopt a “compact” formatting - typically to make them more human-readable - where a column ends up cramming many values in every cell. From an Envision perspective, when facing such a situation, we typically want to denormalize the data, i.e. to turn each line into many identical lines varying only by the cell which was originally containing multiple values.

Parsing catch: Text values are limited to 256 characters in Envision. However, when it comes to multi-value cells as found within flat text files (e.g. CSV files), then those cells can exceed this 256 characters limit as long as each value individually remain under 256 characters.

The split option can be used to perform this denormalization. Let’s consider an input file named foo.tsv (3 lines plus the header) where the column MyList contains multiple comma-separated values:

    MyRef MyList
    1     A,B,C
    3     D

This file can be read with the syntax as follow:

read "foo.tsv" as T with
   MyRef : text
   "MyList" split:"," as MyElement : text

show table "My denormalized list" with

The content of the table is read a 4 lines table:

    MyRef MyElement
    1     A
    1     B
    1     C
    3     D

Let’s notice that "A,B,C" becomes ["A","B","C"] and "D" becomes ["D"]. The whitespace at both ends of the contents of the resulting array will be trimmed and the empty values will be dropped, e.g. "A ,, B" becomes ["A","B"] and "" becomes []. For each value in the array, a line will be emitted to the table output. For the other (non-split) columns, the values of the other cells on the line will be repeated.

Single column limit: No more than a single column can be decorated with the split option. This limitation is intended to avoid accidental cardinality explosion caused by the denormalization process itself. However, in practice, if you need to denormalize several columns, you can read the same file more than once in the same script, with different split options. For instance, given a file “Lokad_Items.tsv” with columns “Name” (text), “Suppliers” (splittable text) and “Stores” (splittable text), you can read:

read "/foo/Lokad_Items.tsv" as Items[id] with
  Id : text
  Name : text
and as ItemSuppliers expect [id] with
  Id : text
  "Suppliers" split: "," as Supplier : text
and as ItemStores expect [id] with
  Id : text
  "Stores" split: "," as Store : text

Wildcard filters

The wildcard (*) offers the possibility of selecting multiple files at once, and reading them as a single table. When included in the read file path, the wildcard can replace any string that does not contain the special character /: it may thus be used in the file name as well as in a folder name, and may include the file format. An empty string would also match the wildcard. Finally, multiple wildcards within the same path are supported. As an example,

read "/foo/Lokad_Orders*.gz" as Orders with

could match “/foo/Lokad_Orders-2018-07-25.csv.gz” and “/foo/Lokad_Orders.gz”.

Typically, the wildcard is used to read a set of historical data stored in several files (and folders), containing a date in their names. However, sometimes, the intent is to read only a single file out of them, for example the most recent one. This can be done by applying a filter on the set of file matching the wildcard, as follows:

read max "/foo/Lokad_Orders*.gz" as Orders with

There are 3 wildcard filters available:

In addition, Envision supports the possibility of reading a specified number of files. This can be expressed by adding a literal positive integer just after the wildcard filter (e.g., max 3).

Defining the type expectations

Unless specified otherwise, Envision expects any table, besides the Items table, to contain both an Id and a Date column, which can be obtained through renaming the relevant columns, as detailed in the previous section. However, sometimes, a table might not exactly correspond to these expectations, and the Envision syntax can be used to clarify any expectations associated with a table. Similarly, the type of column is typically inferred from the Envision script itself, but sometimes, type inference alone is not sufficient. Again, the Envision syntax offers the possibility to specify any column type that may be necessary.

Envision offers the possibility to define the primary keys for a table, by listing them between brackets right after the definition. There are only 5 primary key combinations that are allowed:

read "a.csv" as A // no expect
read "b.csv" as B expect [Id]
read "c.csv" as C expect [Date]
read "d.csv" as D expect [Id, Date]

Case A is the default behavior, it defines a table with no constraints at all. It accommodates diverse scenarios, but when there is no Id column, you will be missing some of the niceties of the Envision language.

Case B expect [Id] defines a table with any number of lines per item. For example, a table representing a ranvars falls into this category.

Case C expect [Date] defines a table that contains any number of scalar values for certain days. It is an extension of the previous case, where a given day can be associated with multiple factors.

Case D expect [Id, Date] define a table that contains both items and dates. In practice, most historical data falls into this case, such as the sales order history for example.

Type constraints on table columns

Envision is a strongly typed language. It means that all vectors are associated with one of the types available in Envision. There are only 4 types available in Envision: text, number, date and boolean, and input types must be specified in the read statement, e.g., as follows:

read "a.csv" as A with
  "Foo" as X : text
  Y : number
  Z : date

Then, going back to the sample dataset, if we write:

read "/sample/Lokad_Items.tsv" as Items[Id] with
  Id : text
read "/sample/Lokad_Orders.tsv" as Orders expect [Id, Date] with
  Id : text
  Date : date
  Quantity : number
  Client : text
read "/sample/Lokad_PurchaseOrders.tsv" as PO expect [Id, Date] with
  Id : text
  Date : date

Quantity = sum(Orders.Quantity)

then, the vector Orders.Quantity should be typed as a number because only numbers are eligible to be summed. As a result, this means that when Envision parses the Lokad_Orders.tsv file, it expects the column Quantity to contain tokens that can be successfully parsed as numbers. If we tried to write:

Items.Nonsense = sum(Orders.Client)

Envision would try to parse the Client column of the Lokad_Orders.tsv file with numbers, and the process would fail because this column contains client identifiers that are not numbers.

Envision provides a syntax to explicitly specify the type to be expected for every column of every table. For example, the sample dataset can be explicitly typed using the following statements:

expect Items.Supplier : text
expect Orders.NetAmount : number
expect PO.ArrivalDate : date

The syntax simply goes as follows - expect MyTable.MyColumn : type, where type is one of the four eligible types: text, number, date or Boolean. The most widely used type assertion is the date type, because based on arithmetic operations and typing inference alone, it is not always possible to correctly infer that a column is expected to be a date.

Parsing options

The Envision’s file parser is very tolerant, however, there are some situations where the parser needs a bit of help. The skip option can be used to tell parser to skip the N first lines of the flat file. The syntax is as follows:

read "/foo/bar*.csv" skip:2 as MyTable with
  "Foo1" as Id : text
  "Foo2" as Date : date

In the example above, the parser skips the first two lines of the flat file and expects the column headers to be found on the third line. The skip option is optional, with skip:0 being the default behavior. This option is intended to cope with systems which introduce meta-data at the very beginning of their flat file extractions.

Unsafe read

By default, Envision’s file parser is strict: if a value is expected to be a date or a number, and if this value cannot be successfully parsed as such, the read statement will fail with an error. As a rule of thumb, when Envision encounters parsing errors, the best option consists of investigating why the file is corrupted in the first place. In fact, as Envision is capable of recognizing many formats of dates and numbers, the chances are that if Envision fails, data is likely to be corrupted. Data corruption can lead to all sort of errors. When Envision fails, more often than not, this is merely the visible sign of a problem that happened earlier.

However, with large datasets, minor data corruptions become much more difficult to avoid. In particular, when corrupted data also happens to be old, fixing the data corruption might not even be worth the effort. Consequently, for these specific situations, Envision supports an unsafe read mode, as illustrated by the syntax below:

read "/foo/orders.tsv" unsafe as Orders expect [Id, Date] with
  "My Id" as Id : text
  "My Date" as Date : date

When unsafe is used, Envision treats the parsing issues as warnings rather than errors. This option allows calculations to go ahead even if values or lines have been discarded by the parser because they could not be read.