September 1, 2017

Data

Overview

  • 10 unique data sources.
  • Data from 6 exchanges, 1 depository, and a microfinance institution.
  • 2 proprietary databases: CMIE ProwessDX, Reuters terminal.
  • Public data from websites.
  • Data use contracts with 5 institutions.
  • Estimate size: 50+ TB
  • Organised on filesystem in around 6.5 million files.

Data sources

Data from NSE

Features

  • Time period: 17 years (1999 - 2015)
  • Frequency of data flow: daily (EOD)
  • Frequency of transactions: tick-by-tick
  • Markets: spot, futures & options on spot, and currency derivatives
  • Trades and Orders (TAO)
  • Market By Price (MBP)
  • Limit Order Book (LOB) snapshots
  • Cash index data

Data types

  • Trades and Orders (TAO)
  • Contains all transactions happened at NSE.
  • Frequency of transactions: tick-by-tick
  • Transactions/second: ~60,000
  • Transactions/day: ~700 million
  • Market By Price (MBP)
  • Contains order book snapshots for all stocks
  • Frequency of transactions: each second
  • Transactions/day: ~45 million

Data types

  • Limit Order Book (LOB)
  • Contains order book snapshots for all stocks
  • Frequency of transactions: hourly
  • Records/day: ~10 million

Data from MCX

Features

  • Time period: 5+ years
  • Frequency of data flow: daily (EOD)
  • Frequency of transactions: tick-by-tick
  • Market: commodity futures
  • Trades and Orders (TAO)
  • Market By Price (MBP)
  • Bhavcopy

Data types

  • Trades and Orders
  • Contains all transactions happened at MCX.
  • Frequency: tick-by-tick
  • Transactions/day: ~11 million
  • Span: 2013 – 2017
  • MBP
  • Limit order book of commodity futures contract 5-deep.
  • Frequency: sub-second
  • Span: 2012 – 2016

Bhavcopy

  • End of day position of all contracts.
  • Frequency: daily
  • Span: 2003 – 2017

Data from NCDEX

Features

  • Time period: 3+ years
  • Frequency of data flow: multiple
  • Frequnecy of transactions: tick-by-tick
  • Market: commodity futures
  • Trades and Orders data
  • MBP data
  • Bhavcopy

Data types

  • CASH
  • Spot prices of commodities.
  • Data includes time of polling and location.
  • Span: 2012 – 2015
  • Frequency: twice/day
  • MBP
  • Commodity futures contracts.
  • Contains limit order book, 5 deep.
  • Span: 2014 – 2016

Data types

  • Bid Ask data
  • Contains the top bid and ask values of commodity futures contracts.
  • Contains OHLC values.
  • Non-validated
  • Span: 2012 – 2015
  • Frequency: every 3-4 seconds

Data types

  • Bid Ask live feed data
  • Contains bid and ask values of commodity futures contracts.
  • Span: 2016 – 2017
  • Frequency: per minute
  • Bhavcopy
  • End of day position of all contracts.
  • Frequency: daily
  • Span: 2006 – 2017

Data from other exchanges

Sources

  • Thomson Reuters Eikon
  • DGCX (trades and quotes): INR/USD futures
  • SGX (trades and quotes): Nifty futures
  • BSE public data
  • NSE public data

Data from NSDL

Features

  • Span: 2007 – 2017
  • Frequency: daily
  • Off-site data

Holdings and transactions data

  • Contains information for each unique client (unique) including location and category.
  • Holdings data contains holding quantity of each client for a unique ISIN.
  • Transactions data contains unique source and client ids for every unique ISIN.
  • No direct access to database.

Firm level database (FLD)

Features

  • Full dumps of CMIE Prowess and CMIE Capex databases.
  • Contains more than 40,000 firms.
  • Has 107 tables.
  • Data frequency: daily, monthly, quarterly, annual
  • Database: MySQL
  • Size: 12GB
  • Uses SQL to access data.

Macro economy data (tsdb)

Features

  • Macro-economic time series data
  • 20,000+ series
  • 23 sources
  • 200+ countries
  • Data span: 15-20 years

Software

Database

  • Flat file system.
  • No binary database system.
  • Can scale to any number of machines for computation.
  • Use of open source software.
  • Efficient and low cost.
  • A metadata database to manage data.

Application Programming Interface (API)

  • IntraDay DataBase (iddb)
  • Unified interface to fetch any data type or source.
  • Can run programs to derive data.
  • Designed to scale in terms of adding features.
  • Can handle parallelism.
  • Packaged in R package format to maintain discipline.

eventstudies

  • An R package to conduct standard event study analysis using not only daily return data but also intra-day returns data.
  • A variable event window can be defined accordingly the estimation period is defined.
  • Models for estimation of abnormal returns:
  • Market model
  • Excess returns model
  • Constant mean returns model
  • Augmented market model
  • Inference strategies provided:
  • bootstrap
  • Wilcox
  • Classic (t-test)

ifrogs

A package with miscellaneous functions relating to computation of various financial measures. Currently, it has three modules:

  1. DtD: Computation of distance to default.
  2. pdshare: Information share.
  3. vix_ci: Computation of VIX.

Data processes

Daily process

  • Daily download
  • Validation of fields
  • Backup on another server
  • Split files into stocks
  • Backup

Derived measures

  • Impact cost
  • Order matching engine
  • Algorithmic trading intensity
  • High frequency trading characteristics
  • Tracking of orders
  • Basis computation
  • Active and passive orders identification
  • Implied volatility

Order matching

  • Order entry, modification, and cancellation are \(O(1)\) time complexity.
  • Search is also \(O(1)\).
  • \(O(n^{2})\) for producing depth.
  • The limit order book is a pre-allocated array of pointers indexed by price.
  • At each price, orders are sorted by time.
  • There are two linear searches of \(O(n)\):
  • Searching for an order at a price.
  • Finding the next best price.

Order matching

The process

  1. Download updated code repository.
  2. Split compressed data using grep.
  3. Prepare data to use with matching engine.
  4. Run pre-open program.
  5. Run matching engine.
  6. Output trades, depth, and log files.
  7. Calculate error percentages.

Performance

  • Stock: SBIN/INFY
  • Depth: 10
  • Segment: Cash
Year Number of orders Time (core) Time (overall)
2009 65,083 18.74 74
2010 1,271,870 48.96 140
2011 3,013,716 105 276
2012 3,248,293 129.45 301
2013 5,188,148 178.40 424
2014 4,113,407 142.75 386

Performance graph (Cash)

Performance

  • Stock: SBIN
  • Depth: 10
  • Segment: Futures & options
Year Number of orders Time (core) Time (overall)
2009 722,517 70.29 124
2010 2,506,027 122.29 426
2011 4,945,747 311.29 519
2012 10,401,516 517.26 780
2013 14,288,199 716.56 1020
2014 16,591,150 1115.02 1620

Performance graph (FAO)

Performance: AT Intensity

Performance: Impact Cost

Demo 1: Computing AT intensity

d <- as.Date("2014-05-23")
system.time(
    atintensity <- iddb(dates = d,
    symbol = "INFY",
    segment = "CASH",
    measures = "atintensity",
    type = "value",
    jiffylevel = FALSE)
)

Demo 2: Volatility computation using MBP data

system.time(
    mbp <- iddb(dates = d,
        symbol = "INFY",
        segment = "CASH",
        measures = "mbp",
        fields =
        "ltp")
    )
vol <-function(x) {
    x <-xts::to.minutes(x, OHLC = FALSE)
    vol <- sd(x/100)
    return(vol)
    }

Demo 3: MBP data from NCDEX

system.time(
    comm <- iddb(dates = d,
    symbol = "CASTORSEED",
    segment = "FAO",
    measures = "mbp",
    exchange = "NCDEX",
    fields = "ltp")
    )
vol(comm[[1]])

Demo 4: Generating data on the fly

d <- as.Date("2014-05-27")
mbp <- iddb(dates = d,
            symbol = "3IINFOTECH",
            segment = "CASH",
            measures = "mbp",
            fields = c("bbp",
            "bsp"),
        depth = 5,
        generate = TRUE,
        wait = TRUE,
        showConsole = TRUE)

Parallel computing

Process

  • Distribute workload across nodes.
  • Identify tasks to run on different machines.
  • Split the tasks and execute.
  • Fetch or store the results.

Strategy

  • Processing on daily files.
  • No dependence on previous day's results.
  • Embarassingly parallel tasks.
  • Scalable to N CPUs.

Job scheduling: Jenkins (broken)

  • Central system to do run tasks.
  • Handles multiple nodes over SSH.
  • Automatic task allocation to nodes.
  • Provides HTTP API.
  • Handles error notifications.
  • Good replacement for cron.
  • Easy management.

Demo 5: Generating with N CPUs

from <- as.Date("2014-05-22")
to <- as.Date("2014-05-28")
mbp <- iddb(from = from,
            to = to,
            symbol = "3IINFOTECH",
            segment = "CASH",
            measures = "mbp",
            exchange = "NSE",
            fields = c("bbp", "bsp", "ltp"),
            depth = 5,
            generate = TRUE,
            yield = FALSE,
            ncpu = 3)

Programming

Choice of language

  • Very high performance: C
  • Big data processing: C
  • Statistics: R
  • Data sets less than 5GB: R (data.table)
  • Scripting: UNIX shell, R, and Python

Code writing

  • Code optimisations
  • Use of integers wherever possible.
  • Vectorised code.
  • Use of 'matrix' rather than 'data.frame'.
  • Knowing copy-on-write semantics of the language.

Hardware

System design

Configuration

  • Mother server: 24 cores, 128GB RAM, 33TB
  • Workstations: 4 cores, 32GB RAM, 120GB SSD
  • File servers: 4 cores, 32GB RAM, 120GB SSD
  • Ethernet switch: 1Gbps
  • OS: Arch Linux

Storage Area Network:

  • Storage: 36TB
  • Connectivity: 8Gbps Fiber channel
  • 2 load balancing/fail-over nodes
  • Supports power failure in a single node

Hardware

Total capacity

  • 48 CPU cores
  • 24 x Intel Core i7-3770 @3.4GHz
  • 24 x AMD Opteron 6172 @2.1 GHz
  • Total memory: 320GB
  • Storage: 99TB

Performance

  • ~300 GFLOP/S
  • Linpack benchmark done using matrix of size \(68000\).

Future design

Total capacity

  • 88 CPU cores
  • 64 x Intel Core i7-3770 @3.4GHz
  • 24 x AMD Opteron 6172 @2.1 GHz
  • Total memory: 640GB
  • Storage: 136TB

Thank you