Envision loads and processes tabular files that are stored in your Lokad account. Furthermore, Lokad provides native support for a series of third party apps. When your app is supported by Lokad, importing data and creating the necessary files is a process that is completely managed by us internally. However, when this is not the case, or when Lokad cannot retrieve all the relevant data, it is possible to import files into Lokad manually. This page details how these tabular files should be formatted to be compatible with Envision. The Envision syntax relative to how these tabular files are read by the system is also documented below.
The file formats supported by Lokad are:
- CSV , TSV and other similar usual flat text files,
- Microsoft Excel 2007+ files,
- GZip archives,
.gzfiles; content is expected to be a flat file.
- WinZip archives ,
.zipfiles; content is expected to be a flat file, archive should contain exactly 1 file.
- 7z archives,
.7zfiles; content is expected to be a flat file, archive should contain exactly 1 file.
The syntax to read files in Envision is:
read "/foo/myorders.csv.gz" as Orders with "My Id" as Id Quantity : number // wildcard '*': here, multiple files will be read read "/foo/promos*.xlsx" as Promos[*] // max within the wildcard '*': only one file is read read max "/foo/stocks*.xlsx" as Stocks[*] // 'unsafe': tolerates parsing issues read "/foo/backorders.tsv.zip" unsafe as BO[Date, *]
A file storage with connectors
Every Lokad account comes with its own file storage service. Simply put, it is possible to create folders, and to upload files into these folders. In this respect, Lokad offers a service similar to other file-hosting services, like Box or Dropbox, except that most file-sharing features are absent. In fact, the intent of Lokad’s file storage system is not to be another file-sharing app, but to provide a fully transparent way of having access to all the data used by Lokad whenever your commerce is being analyzed. Thus, whenever a forecast or a dashboard is generated by Lokad, the corresponding data exists within your account in the form of files that can be downloaded, and possibly analyzed, independently of Lokad.
For smaller businesses, producing files that gather all the relevant historical data of the company can be a very tedious exercise, because such businesses do not always have significant IT resources in the first place. Thus, for many popular commerce apps - Brightpearl, Linnworks, QuickBooks, TradeGecko, Unleashed, Vend … to name a few - Lokad provides built-in connectors. These connectors can connect to the app, typically using an online API (Application Programming Interface), and generate, within a Lokad account, files that are directly formatted in a way that make them most suitable to be processed by Envision.
If your business app is not supported (if that is the case drop us an email, as we tend to support the apps that are most frequently requested), or if Lokad’s built-in support does not cover all the relevant data, in this case it is possible to upload files directly into Lokad as we support manual uploads through the web. However, as data will have to be refreshed on a regular basis, it is a lot more practical if the data transfer can take place in a fully automated manner. Therefore, Lokad also supports protocols such as FTP and SFTP that offer the possibility to automate all file transfers.
Extensive support for tabular files
Lokad supports quite a diverse range of file formats that can contain tabular data . We support popular formats such as Excel sheets or CSV (comma-separated value) files. If you have experience with flat file import/export, you might already know that there are a myriad of small technical details that can make the process quite tedious:
- Encoding of the file might differ from the encoding expected by the app.
- Dates or numbers might not be formatted as expected.
- Column delimiter or the line return encoding might also differ.
We have designed our system to be capable of auto-detecting all the technicalities such as encoding, date and number formats, delimiters … While everything happens automatically, it is a fairly non-trivial process that takes place within Lokad when files are read by our system. For example, we are now supporting the auto-detection of more than 100 distinct date formats.
In practice, the guidelines for the tabular file format are simple: the column names should be listed as the first line of the file (this requirement applies to Excel sheets too), and one has to make sure that the token values do not collide with delimiters when using flat text files such as CSV or TSV.
Also, when flat files grow in size, it might become more practical to compress them before uploading them to Lokad. Lokad supports flat text files compressed with GZip as long as the
.gz extension is added at the end of the file name. For example,
Lokad_Items.tsv.gz is recognized as a TSV file that is compressed with GZip. The same pattern works with the
.zip extension for WinZip archives, and with
.7z extension for 7z archives. In the case of WinZip and 7z, the archives should only contain a single compressed file; Envision does not support multi-file archives. Excel sheets are already compressed, so there is no need to compress them, even if they happen to be very large.
File and column naming guidelines
Lokad can accommodate pretty much any file name and any column name, but if certain guidelines are followed, the resulting Envision script can be made a lot simpler. The sample dataset is a good example of a set of files that follow these guidelines. If you have not yet had a chance to look at this dataset until now, we suggest to start here.
The files are expected to be named following this pattern: in
TableName is replaced by the name of the table, and
xyz is replaced by the actual file extension, for example
xlsx for Excel sheets. The table name should not contain any spaces, and it should not begin with a digit either. By following these file naming conventions, Envision is capable of auto-detecting the relevant tables to be uploaded into Lokad without any further effort.
In particular, the items table is a special case, and for many Envision scripts, there are benefits to structuring your data around such an items table. However, the items table is not required, and Envision can run without loading such a table. If a file named
Lokad_Items.xyz is provided to Envision, it will be treated as the items table by default.
If a table happens to be split into multiple files, for example because the individual files are too large, then, Envision can consolidate all these files into one table. In order to achieve this, all these files should be named
Lokad_TableName_Suffix.xyz where the
Suffix varies from one file to another. For example, if daily extracts are produced for sales extracts, then one of the daily sales extracts is likely to be named
Suffix plays no specific role except for keeping the files distinct, so anything goes here. Naturally, Envision expects to find the same columns across all files varying only by their suffix, otherwise Envision cannot consolidate these files into one table without additional instructions.
The column names come with little expectations from an Envision point of view. Envision also provides some default behaviors:
- When a column named
Idis found in a flat file, by default, this column is treated as a foreign key to the
Idcolumn originally found in the
- When a column is named with a name that ends with
Date, by default, this column is treated as a date by Envision.
Most column names are acceptable as long as they do not contain spaces and do not begin with a digit. Envision also provides ways to override the behaviors listed above. However, whenever possible, we suggest sticking to the guidelines, as it will reduce the amount of scripting overhead required within Envision to start getting results.
Syntax to read files
Envision supports a rich syntax for reading files that can handle situations when our naming guidelines cannot be followed (e.g. when the files cannot be internally modified). The syntax is outlined below. It is not yet the fully general syntax; some additional options are detailed below.
read "/foo/bar*.xyz" as MyTable with "Foo1" as Id "Foo2" as Date "Foo 3" as Foo3
/foo/bar*.xyz path may contain a wildcard (*), and this wildcard can be replaced by any sequence of characters. Using a wildcard is optional, but when used, the wildcard offers the possibility of capturing multiple files at once. This pattern is similar to the shell syntax when listing files from the command-line.
as keyword, found right after the path, indicates the name of the table. If the table is named
MyTable, then the Envision script will refer to this table by writing
MyTable.Id for example. If the table to be uploaded is the items table itself, then this
as MyTable can be skipped.
The instructions that come afterwards in the form of the
with keyword act as column renaming instructions. In the example above, the
Foo1 column is renamed as
Id and the
Foo2 column is renamed as
Date. Through renaming these two columns, the
MyTable table becomes a legitimate Envision table containing historical data as both the
Date columns are now properly defined.
The third renaming operation takes place as
Foo 3 (notice the space in the original column name) is renamed into
Foo3. This illustrates how an incorrect column name can be turned into a correct one. It is also important to note that vector variable names do not allow for spaces. For the sake of concision, we did not include any further renamed pairings, but actually, as long as the pairings are properly separated on different lines (or delimited by a comma), there is no limit to the number of renaming operations that can take place for any single file. In practice, renaming a column can be useful for making a script more readable, but it can also be used to make the required adjustments when it is necessary to consolidate multiple files into one table, while these files have inconsistent column names.
For example, let’s assume that the order history is split between two files. First, we have
Lokad_Orders_Old.tsv that contains all the data up to December 31st 2014. This file has three columns named
Quantity respectively. This file’s columns do not follow Envision’s column naming guidelines. Second, we have
Lokad_Orders.tsv, a more recent file, that contains all the history starting from January 1st 2015. This file contains three columns too, and these columns are named
Quantity, which fits with Envision’s default guidelines. The script below illustrates how the two files can be consolidated into a single
read "/foo/Lokad_Orders_Old.tsv" as Orders with "ItemId" as "Id" "OrderDate" as Date read "/foo/Lokad_Orders.tsv" as Orders
Another common configuration that can be faced while consolidating several files is a column that is present only in some of them. In this case, Envision allows to include the missing column in the read statement, by attributing a constant value. Let us imagine, for example, that the new file
Lokad_Orders.tsv contains a column
SaleChannel, which is missing in
Lokad_Orders_Old.tsv, or that we need to introduce a flag to distinguish the two files (useful when consolidating an incremental file with the full history). The two files can then be read as follows
read "/foo/Lokad_Orders_Old.tsv" as Orders with "ItemId" as "Id" "OrderDate" as Date const("default") as SaleChannel const(false) as IsNewFile read "/foo/Lokad_Orders.tsv" as Orders with const(true) as IsNewFile
where, in general, the argument of
const could be of type text, Boolean or number.
Naturally, it is possible to define as many
read statements as it takes to properly cover all the files and all the tables. Envision does not impose a precise location for these statements within your script, so technically they could be placed anywhere, even at the very bottom of the script. However, we advise to keep these statements at the beginning of the scripts because this is where people familiar with programming languages would expect to find them.
The wildcard (
*) offers the possibility of selecting multiple files at once, and reading them as a single table. When included in the
read file path, the wildcard can replace any string that does not contain the special character
/: it may thus be used in the file name as well as in a folder name, and may include the file format. An empty string would also match the wildcard. Finally, multiple wildcards within the same path are supported. As an example,
read "/foo/Lokad_Orders*.gz" as Orders
could match “/foo/Lokad_Orders-2018-07-25.csv.gz” and “/foo/Lokad_Orders.gz”.
Typically, the wildcard is used to read a set of historical data stored in several files (and folders), containing a date in their names. However, sometimes, the intent is to read only a single file out of them, for example the most recent one. This can be done by applying a filter on the set of file matching the wildcard, as follows:
read max "/foo/Lokad_Orders*.gz" as Orders
There are 3 wildcard filters available:
min: the first file when ordering files based on their respective names (lexicographic ordering).
max: the last file when ordering files based on their respective names (lexicographic ordering).
latest: the last file when ordering files based on their last modified date.
In addition, Envision supports the possibility of reading a specified number of files. This can be expressed by adding a literal positive integer just after the wildcard filter (e.g.,
Defining the type expectations
Unless specified otherwise, Envision expects any table, besides the
Items table, to contain both an
Id and a
Date column, which can be obtained through renaming the relevant columns, as detailed in the previous section. However, sometimes, a table might not exactly correspond to these expectations, and the Envision syntax can be used to clarify any expectations associated with a table. Similarly, the type of column is typically inferred from the Envision script itself, but sometimes, type inference alone is not sufficient. Again, the Envision syntax offers the possibility to specify any column type that may be necessary.
Envision offers the possibility to define the primary keys for a table, by listing them between brackets right after the definition. There are only 7 primary key combinations that are allowed:
read "a.csv" as A[*] read "b.csv" as B[Id] read "c.csv" as C[Id, *] read "d.csv" as D[Date] read "e.csv" as E[Date, *] read "f.csv" as F[Id, Date] read "g.csv" as G[Id, Date, *]
A[*] defines a table with no constraints at all. It’s a wildcard and it accommodates very diverse scenarios, but when there is no
Id column, you will be missing some of the niceties of the Envision language.
B[Id] defines a table with one line per item at most. For example, our inventory settings fall into this category.
C[Id, *] defines a table with any number of lines per item. For example, a table representing a distribution of probabilities falls into this category.
D[Date] defines a table that contains a single scalar value for certain days. For example, such a table could be used to list the national holidays applicable to a certain geographical location.
E[Date, *] defines a table that contains any number of scalar values for certain days. It is an extension of the previous case, where a given day can be associated with multiple factors.
F[Id, Date] defines a table that contains at most one value for each
[Id, Date] pairing. For example, such a table could be used to list the past stock-out occurrences.
G[Id, Date, *] is the default behavior, the one you get when omitting brackets. In practice, most historical data falls into this case, such as the sales order history for example.
Type constraints on table columns
Envision is a strongly typed language. It means that all vectors are associated with one of the types available in Envision. There are only 4 types available in Envision:
boolean, and these types are typically inferred directly from the script itself. However, it is possible to specify the type expectations with:
read "a.csv" as A[*] with "Foo" as X : text Y : number Z : date
Then, going back to the sample dataset, if we write:
// these three lines are standard // and may be from now on omitted read "/sample/Lokad_Items.tsv" read "/sample/Lokad_Orders.tsv" as Orders read "/sample/Lokad_PurchaseOrders.tsv" as PO Quantity = sum(Orders.Quantity)
then, the vector
Orders.Quantity is implicitly typed as a number because only numbers are eligible to be summed. As a result, this means that when Envision parses the
Lokad_Orders.tsv file, it expects the column
Quantity to contain tokens that can be successfully parsed as numbers. If we were to try writing:
Nonsense = sum(Orders.Client)
In this case, Envision would try to parse the
Client column of the
Lokad_Orders.tsv file with numbers, and the process would fail, because this column contains client identifiers that are not numbers. In addition to the type inference mechanism, Envision also provides a syntax to explicitly specify the type to be expected for every column of every table. For example, the sample dataset can be explicitly typed using the following statements:
expect Supplier : text expect Orders.NetAmount : number expect PO.ArrivalDate : date
The syntax simply goes as follows -
expect MyTable.MyColumn : type, where
type is one of the four eligible types:
boolean. The most widely used type assertion is the
date type, because based on arithmetic operations and typing inference alone, it is not always possible to correctly infer that a column is expected to be a date.
The Envision’s file parser is very tolerant, however, there are some situations where the parser needs a bit of help. The
skip option can be used to tell parser to skip the N first lines of the flat file. The syntax is as follows:
read "/foo/bar*.csv" skip:2 as MyTable with "Foo1" as Id "Foo2" as Date
In the example above, the parser skips the first two lines of the flat file and expects the column headers to be found on the third line. The
skip option is optional, with
skip:0 being the default behavior. This option is intended to cope with systems which introduce meta-data at the very beginning of their flat file extractions.
By default, Envision’s file parser is strict: if a value is expected to be a date or a number, and if this value cannot be successfully parsed as such, the
read statement will fail with an error. As a rule of thumb, when Envision encounters parsing errors, the best option consists of investigating why the file is corrupted in the first place. In fact, as Envision is capable of recognizing many formats of dates and numbers, the chances are that if Envision fails, data is likely to be corrupted. Data corruption can lead to all sort of errors. When Envision fails, more often than not, this is merely the visible sign of a problem that happened earlier.
However, with large datasets, minor data corruptions become much more difficult to avoid. In particular, when corrupted data also happens to be old, fixing the data corruption might not even be worth the effort. Consequently, for these specific situations, Envision supports an
unsafe read mode, as illustrated by the syntax below:
read "/foo/orders.tsv" unsafe with "My Id" as Id "My Date" as Date
unsafe is used, Envision treats the parsing issues as warnings rather than errors. This option allows calculations to go ahead even if values or lines have been discarded by the parser because they could not be read.