FRG database/systems

September 1, 2017

Data

Overview

10 unique data sources.
Data from 6 exchanges, 1 depository, and a microfinance institution.
2 proprietary databases: CMIE ProwessDX, Reuters terminal.
Public data from websites.
Data use contracts with 5 institutions.
Estimate size: 50+ TB
Organised on filesystem in around 6.5 million files.

Data sources

Data from NSE

Features

Time period: 17 years (1999 - 2015)
Frequency of data flow: daily (EOD)
Frequency of transactions: tick-by-tick
Markets: spot, futures & options on spot, and currency derivatives
Trades and Orders (TAO)
Market By Price (MBP)
Limit Order Book (LOB) snapshots
Cash index data

Data types

Trades and Orders (TAO)
Contains all transactions happened at NSE.
Frequency of transactions: tick-by-tick
Transactions/second: ~60,000
Transactions/day: ~700 million
Market By Price (MBP)
Contains order book snapshots for all stocks
Frequency of transactions: each second
Transactions/day: ~45 million

Data types

Limit Order Book (LOB)
Contains order book snapshots for all stocks
Frequency of transactions: hourly
Records/day: ~10 million

Data from MCX

Features

Time period: 5+ years
Frequency of data flow: daily (EOD)
Frequency of transactions: tick-by-tick
Market: commodity futures
Trades and Orders (TAO)
Market By Price (MBP)
Bhavcopy

Data types

Trades and Orders
Contains all transactions happened at MCX.
Frequency: tick-by-tick
Transactions/day: ~11 million
Span: 2013 – 2017
MBP
Limit order book of commodity futures contract 5-deep.
Frequency: sub-second
Span: 2012 – 2016

Bhavcopy

End of day position of all contracts.
Frequency: daily
Span: 2003 – 2017

Data from NCDEX

Features

Time period: 3+ years
Frequency of data flow: multiple
Frequnecy of transactions: tick-by-tick
Market: commodity futures
Trades and Orders data
MBP data
Bhavcopy

Data types

CASH
Spot prices of commodities.
Data includes time of polling and location.
Span: 2012 – 2015
Frequency: twice/day
MBP
Commodity futures contracts.
Contains limit order book, 5 deep.
Span: 2014 – 2016

Data types

Bid Ask data
Contains the top bid and ask values of commodity futures contracts.
Contains OHLC values.
Non-validated
Span: 2012 – 2015
Frequency: every 3-4 seconds

Data types

Bid Ask live feed data
Contains bid and ask values of commodity futures contracts.
Span: 2016 – 2017
Frequency: per minute
Bhavcopy
End of day position of all contracts.
Frequency: daily
Span: 2006 – 2017

Data from other exchanges

Sources

Thomson Reuters Eikon
DGCX (trades and quotes): INR/USD futures
SGX (trades and quotes): Nifty futures
BSE public data
NSE public data

Data from NSDL

Features

Span: 2007 – 2017
Frequency: daily
Off-site data

Holdings and transactions data

Contains information for each unique client (unique) including location and category.
Holdings data contains holding quantity of each client for a unique ISIN.
Transactions data contains unique source and client ids for every unique ISIN.
No direct access to database.

Firm level database (FLD)

Features

Full dumps of CMIE Prowess and CMIE Capex databases.
Contains more than 40,000 firms.
Has 107 tables.
Data frequency: daily, monthly, quarterly, annual
Database: MySQL
Size: 12GB
Uses SQL to access data.

Macro economy data (tsdb)

Features

Macro-economic time series data
20,000+ series
23 sources
200+ countries
Data span: 15-20 years

Software

Database

Flat file system.
No binary database system.
Can scale to any number of machines for computation.
Use of open source software.
Efficient and low cost.
A metadata database to manage data.

Application Programming Interface (API)

IntraDay DataBase (iddb)
Unified interface to fetch any data type or source.
Can run programs to derive data.
Designed to scale in terms of adding features.
Can handle parallelism.
Packaged in R package format to maintain discipline.

eventstudies

An R package to conduct standard event study analysis using not only daily return data but also intra-day returns data.
A variable event window can be defined accordingly the estimation period is defined.
Models for estimation of abnormal returns:
Market model
Excess returns model
Constant mean returns model
Augmented market model
Inference strategies provided:
bootstrap
Wilcox
Classic (t-test)

ifrogs

A package with miscellaneous functions relating to computation of various financial measures. Currently, it has three modules:

DtD: Computation of distance to default.
pdshare: Information share.
vix_ci: Computation of VIX.

Data processes

Daily process

Daily download
Validation of fields
Backup on another server
Split files into stocks
Backup

Derived measures

Impact cost
Order matching engine
Algorithmic trading intensity
High frequency trading characteristics
Tracking of orders
Basis computation
Active and passive orders identification
Implied volatility

Order matching

Order entry, modification, and cancellation are \(O(1)\) time complexity.
Search is also \(O(1)\).
\(O(n^{2})\) for producing depth.
The limit order book is a pre-allocated array of pointers indexed by price.
At each price, orders are sorted by time.
There are two linear searches of \(O(n)\):
Searching for an order at a price.
Finding the next best price.

Order matching

The process

Download updated code repository.
Split compressed data using grep.
Prepare data to use with matching engine.
Run pre-open program.
Run matching engine.
Output trades, depth, and log files.
Calculate error percentages.

Performance

Stock: SBIN/INFY
Depth: 10
Segment: Cash

Year	Number of orders	Time (core)	Time (overall)
2009	65,083	18.74	74
2010	1,271,870	48.96	140
2011	3,013,716	105	276
2012	3,248,293	129.45	301
2013	5,188,148	178.40	424
2014	4,113,407	142.75	386

Performance graph (Cash)

Performance

Stock: SBIN
Depth: 10
Segment: Futures & options

Year	Number of orders	Time (core)	Time (overall)

2009	722,517	70.29	124
2010	2,506,027	122.29	426
2011	4,945,747	311.29	519
2012	10,401,516	517.26	780
2013	14,288,199	716.56	1020
2014	16,591,150	1115.02	1620

Performance graph (FAO)

Performance: AT Intensity

Performance: Impact Cost

Demo 1: Computing AT intensity

d <- as.Date("2014-05-23")
system.time(
    atintensity <- iddb(dates = d,
    symbol = "INFY",
    segment = "CASH",
    measures = "atintensity",
    type = "value",
    jiffylevel = FALSE)
)

Demo 2: Volatility computation using MBP data

system.time(
    mbp <- iddb(dates = d,
        symbol = "INFY",
        segment = "CASH",
        measures = "mbp",
        fields =
        "ltp")
    )
vol <-function(x) {
    x <-xts::to.minutes(x, OHLC = FALSE)
    vol <- sd(x/100)
    return(vol)
    }

Demo 3: MBP data from NCDEX

system.time(
    comm <- iddb(dates = d,
    symbol = "CASTORSEED",
    segment = "FAO",
    measures = "mbp",
    exchange = "NCDEX",
    fields = "ltp")
    )
vol(comm[[1]])

Demo 4: Generating data on the fly

d <- as.Date("2014-05-27")
mbp <- iddb(dates = d,
            symbol = "3IINFOTECH",
            segment = "CASH",
            measures = "mbp",
            fields = c("bbp",
            "bsp"),
        depth = 5,
        generate = TRUE,
        wait = TRUE,
        showConsole = TRUE)

Parallel computing

Process

Distribute workload across nodes.
Identify tasks to run on different machines.
Split the tasks and execute.
Fetch or store the results.

Strategy

Processing on daily files.
No dependence on previous day's results.
Embarassingly parallel tasks.
Scalable to N CPUs.

Job scheduling: Jenkins (broken)

Central system to do run tasks.
Handles multiple nodes over SSH.
Automatic task allocation to nodes.
Provides HTTP API.
Handles error notifications.
Good replacement for cron.
Easy management.

Demo 5: Generating with N CPUs

from <- as.Date("2014-05-22")
to <- as.Date("2014-05-28")
mbp <- iddb(from = from,
            to = to,
            symbol = "3IINFOTECH",
            segment = "CASH",
            measures = "mbp",
            exchange = "NSE",
            fields = c("bbp", "bsp", "ltp"),
            depth = 5,
            generate = TRUE,
            yield = FALSE,
            ncpu = 3)

Programming

Choice of language

Very high performance: C
Big data processing: C
Statistics: R
Data sets less than 5GB: R (data.table)
Scripting: UNIX shell, R, and Python

Code writing

Code optimisations
Use of integers wherever possible.
Vectorised code.
Use of 'matrix' rather than 'data.frame'.
Knowing copy-on-write semantics of the language.

Hardware

System design

Configuration

Mother server: 24 cores, 128GB RAM, 33TB
Workstations: 4 cores, 32GB RAM, 120GB SSD
File servers: 4 cores, 32GB RAM, 120GB SSD
Ethernet switch: 1Gbps
OS: Arch Linux

Storage Area Network:

Storage: 36TB
Connectivity: 8Gbps Fiber channel
2 load balancing/fail-over nodes
Supports power failure in a single node

Hardware

Total capacity

48 CPU cores
24 x Intel Core i7-3770 @3.4GHz
24 x AMD Opteron 6172 @2.1 GHz
Total memory: 320GB
Storage: 99TB

Performance

~300 GFLOP/S
Linpack benchmark done using matrix of size \(68000\).

Future design

Total capacity

88 CPU cores
64 x Intel Core i7-3770 @3.4GHz
24 x AMD Opteron 6172 @2.1 GHz
Total memory: 640GB
Storage: 136TB

Data

Overview

Data sources

Data from NSE

Features

Data types

Data types

Data from MCX

Features

Data types

Bhavcopy

Data from NCDEX

Features

Data types

Data types

Data types

Data from other exchanges

Sources

Data from NSDL

Features

Holdings and transactions data

Firm level database (FLD)

Features

Macro economy data (tsdb)

Features

Software

Database

Application Programming Interface (API)

eventstudies

ifrogs

Data processes

Daily process

Derived measures

Order matching

Order matching

The process

Performance

Performance graph (Cash)

Performance

Performance graph (FAO)

Performance: AT Intensity

Performance: Impact Cost

Demo 1: Computing AT intensity

Demo 2: Volatility computation using MBP data

Demo 3: MBP data from NCDEX

Demo 4: Generating data on the fly

Parallel computing

Process

Strategy

Job scheduling: Jenkins (broken)

Demo 5: Generating with N CPUs

Programming

Choice of language

Code writing

Hardware

System design

Configuration

Storage Area Network:

Hardware

Total capacity

Performance

Future design

Total capacity

Thank you