Datamonster Python Library¶
Release v0.5.2. (QuickStart)
This library eases the use of the Datamonster REST API via Python
Quickstart¶
Installing the Datamonster library:
pip install datamonster_api
Working with companies:
from datamonster_api import DataMonster
dm = DataMonster(<key_id>, <secret_key>)
# Prints all companies whose name or ticker matches 'hd'
print(list(dm.get_companies(query='hd')))
# Creates a company object for apple
apple = dm.get_company_by_ticker('aapl')
# prints the first 5 quarter end dates
print(apple.quarters[:5])
# prints the first 5 data sources that cover apple
print(list(apple.datasources)[:5])
Working with data sources:
# Prints all data sources whose name or provider matches 'fake'
print(list(dm.get_datasources(query='fake')))
# Prints all data sources whose name or provider matches 'fake'
# AND also cover apple
print(list(dm.get_datasources(query='fake', company=apple)))
# Prints first 5 companies covered by `Fake Data Source`
datasource = list(
dm.get_datasources(query='Fake Data Source')
)[0]
print(list(datasource.companies)[:5])
Getting data:
import datetime
from datamonster_api import Aggregation
# Gets a datasource object
apple = dm.get_company_by_ticker('aapl')
datasource = next(apple.datasources)
# Gets all data for the data source filtering on apple
datasource.get_data(apple)
agg = Aggregation(period='fiscalQuarter', company=apple)
# Gets all data for the given data source filtered by apple,
# aggregated by apple's fiscal quarter, and starting on
# January 1, 2017 (inclusive)
datasource.get_data(
apple,
agg,
start_date=datetime.date(2017, 1, 1)
)
This part of the documentation covers all the interfaces of the datamonster-api.
Interface¶
-
class
datamonster_api.
DataMonster
(key_id, secret, server=None, verify=True)[source]¶ DataMonster object. Main entry point to the library
Parameters: - key_id – (str) a user’s public key
- secret – (str) a user’s secret key
- server – (optional, str) default to dm.adaptivemgmt.com
- verify – (optional, bool) whether to verify the server’s TLS certificate
-
get_companies
(query=None, datasource=None)[source]¶ Get available companies
Parameters: - query – Optional query that will restrict companies by ticker or name
- datasource – Optional
Datasource
object that restricts companies to those covered by the given data source
Returns: Iterator of
Company
objects
-
get_company_by_id
(company_id)[source]¶ Get a single company by id
Parameters: company_id – (str or int) unique internal identifier for the desired company. Can take str form e.g. ‘718’, or int form, e.g. 707. In order to find the id of a frequently used company, find the company by ticker and call .pk
on the resultingCompany
objectReturns: Single Company
object if any company matches the idRaises: DataMonsterError
if no company matches id
-
get_company_by_ticker
(ticker)[source]¶ Get a single company by ticker
Parameters: ticker – Ticker to search for Returns: Single Company
object if any companies exactly match the ticker (case insensitive)Raises: DataMonsterError
if no companies match ticker
-
get_company_details
(company_id)[source]¶ Get details for the given company
Parameters: company_id – (str or int) unique internal identifier for company. See the method
get_company_by_id for more info oncompany_id
.Returns: (dict) details (metadata) for this company, providing basic information.
-
get_data
(datasource, company, aggregation=None, start_date=None, end_date=None)[source]¶ Get data for data source
Parameters: - datasource –
Datasource
object to get the data for - company –
Company
object to filter the data source on - aggregation – Optional
Aggregation
object to specify the aggregation of the data - start_date – Optional filter for the start date of the data
- end_date – Optional filter for the end date of the data
See here for example usage.
Returns: pandas.DataFrame - datasource –
-
get_data_group_by_id
(id)[source]¶ Give a data group pk (primary key), return the corresponding
DataGroup
object. To find the pk for a data group, first find it using the iterator returned byget_data_groups()
, then call.id
on theDataGroup
object.Parameters: id – (int) Returns: Single DataGroup
object with the given idRaises: DataMonsterError
if no data group matches the given id
-
get_data_group_details
(id)[source]¶ Given a data group id, return the corresponding
DataGroup
objectParameters: id – (int) Returns: Single DataGroup
object with the given idRaises: DataMonsterError
if no data group matches the given id
-
get_data_groups
(query=None)[source]¶ Get available data groups
Parameters: query – (str) Optional query that will restrict data groups by name or data source name Returns: Iterator of DataGroup
objects.
-
get_data_raw
(datasource, filters=None, aggregation=None)[source]¶ Get raw data for all companies available in the data source.
Parameters: - datasource –
Datasource
object to get the data for - aggregation –
Aggregation
object to specify requested aggregation - filters – dictionary of requested filters
Returns: (schema, pandas.DataFrame)
See here for example usage.
- datasource –
-
get_datasource_by_id
(datasource_id)[source]¶ Given a data source UUID (universal unique identifier), return the corresponding
Datasource
object. To find the UUID for a data source, first find it by name, then call.id
on the resultingDatasource
object.Parameters: datasource_id – (str) Returns: Single Datasource
object with the given idRaises: DataMonsterError
if no data source matches the given id
-
get_datasource_by_name
(name)[source]¶ Given a name, try to find a data source of that name
Parameters: name – (str) Returns: Single Datasource
object with the given nameRaises: DataMonsterError
if no data source matches the given name
-
get_datasource_details
(datasource_id)[source]¶ Get details (metadata) for the data source corresponding to the given UUID
Parameters: datasource_id – (str) See the method
get_datasource_by_id for more info ondatasource_id
Returns: (dict) details (metadata) for this data source, providing basic information.
-
get_datasources
(query=None, company=None)[source]¶ Get available datasources
Parameters: - query – (str) Optional query that will restrict data sources by name or provider name
- company – Optional
Company
object that restricts data sources to those that cover the given company
Returns: Iterator of
Datasource
objects
-
get_dimensions_for_datasource
(datasource, filters=None, add_company_info_from_pks=False)[source]¶ - Get dimensions (“splits”) for the data source
- from the DataMonster REST endpoint
/datasource/<uuid>/dimensions?filters=...
where thefilters
string is optional.
Parameters: - datasource –
Datasource
object - filters – (dict): a dict of key/value pairs to filter dimensions by
- add_company_info_from_pks – (bool): Determines whether return value will include tickers for
the returned companies. If
False
, onlysection_pk
s will be returned.
See here for example usage.
Returns: a DimensionSet
object - an iterable through a collection of dimension dicts, filtered as requested. See this documentation for more info onDimensionSet
objects.Raises: DataMonsterError
iffilters
is not a dict or is not JSON-serializable. Re-raisesDataMonsterError
ifself.client.get()
raises that.
Objects¶
-
class
datamonster_api.
Company
(_id, ticker, name, uri, dm)[source]¶ Representation of a company in DataMonster
Parameters: - _id – (str) unique internal identifier for the company
- ticker – (str) ticker of the company
- name – (str) name of the company
- uri – (str) DataMonster resource identifier associated with the company
- dm –
DataMonster
object
- property ticker
- Returns: (str) company ticker,
None
if company is private - property name
- Returns: (str) company name, including the associated vendor
- property quarters
- Returns: (list) list of company quarter dates, including 4 projected dates.
- Empty if company is private
-
datasources
¶ Returns: (iter) iterable of Datasource
objects associated with this company to which the user has access, memoized
-
get_details
()[source]¶ Get details (metadata) for this company, providing basic information as stored in DataMonster
Returns: (dict)
-
pk
¶ Returns: (int) the unique internal identifier for the company (corresponds to _id
)
-
class
datamonster_api.
Datasource
(_id, name, category, uri, dm)[source]¶ Representation of a data source in DataMonster
Parameters: - _id – (str) unique internal identifier for the data source
- name – (dict) name of the data source, including the vendor for the data
- category – (list) associated categories
- uri – (str) DataMonster resource identifier associated with the data source
- dm –
DataMonster
object
- property name
- Returns (str) name of data source, including vendor
- property category
- Returns (str) category associated with the data source, e.g., Web Scrape Data or Uploaded Data
-
companies
¶ Returns: (iter) iterable of Company
objects associated with this data source, memoized
-
get_data
(company, aggregation=None, start_date=None, end_date=None)[source]¶ Get data for this data source.
Parameters: - company –
Company
object to filter the data source on - aggregation – Optional
Aggregation
object to specify the aggregation of the data - start_date – Optional string to act as a filter for the start date of the data; accepted formats include:
YYYY-MM-DD, MM/DD/YYYY, or pandas or regular
datetime
object - end_date – Optional string to act as a filter for the end date of the data; accepted formats include:
YYYY-MM-DD or MM/DD/YYYY, or pandas or regular
datetime
object
Returns: pandas.DataFrame
- company –
-
get_details
()[source]¶ Get details (metadata) for this data source, providing basic information as stored in DataMonster
Returns: (dict)
-
get_dimensions
(company=None, add_company_info_from_pks=True, **kwargs)[source]¶ - Return the dimensions for this data source,
- restricted to the given company or companies and filtered by any kwargs items. Not memoized.
Parameters: - company – a
Company
object, a list or tuple ofCompany
objects, orNone
. If notNone
the return value will only include rows corresponding to the given companies. - add_company_info_from_pks – Determines whether return value will include tickers for
the returned companies. If
False
, onlysection_pk
s will be returned. - kwargs – Additional items to filter by, e.g.
category='Banana Republic'
Returns: a
DimensionSet
object - an iterable through a collection of dimension dicts, filtered as requested. See this documentation for more info.See here for example usage of a similar function.
Raises: can raise DataMonsterError
if company is not of an expected type, or if some kwarg item is not JSON-serializable.
-
class
datamonster_api.
DimensionSet
(url, dm, add_company_info_from_pks)[source]¶ An iterable through a collection of dimensions dictionaries.
Each dimension dictionary has 4 keys:
max_date
,min_date
,row_count
, andsplit_combination
. The first two have values that are dates as strings in ISO format;split_combination
points to a dict containing data from all other columns;row_count
points to an int specifying how many rows match the dates and all splits insplit_combination
-
has_extra_company_info
¶ Returns: (bool) The value passed as add_company_info_from_pks
to the constructor, coerced to bool.
-
max_date
¶ Returns: (str) max of the max_date
of the dimension dicts
-
min_date
¶ Return type: (str) min of the min_date
of the dimension dicts
-
row_count
¶ Returns: (int) number of rows matching the filters for this DimensionSet
-
Data Upload¶
-
class
datamonster_api.
DataGroup
(_id, name, columns, status, dm)[source]¶ Representation of a DataGroup in DataMonster
Parameters: - _id – (int) unique internal identifier for the Data Group
- name – (str) name of the Data Group
- columns – (list of
DataGroupColumn
objects) representing columns of uploaded data - status – (str, enum) Status of the DataSources in DataGroup at instantiation time. This
property is updated by
get_current_status
. It can take one of the following three values. success if all Data Sources in the group have successfully loaded processing if any DataSource in the group is still processing error if any DataSource in the group is in an error state Note: error takes precedence over processing - dm –
DataMonster
object
-
get_current_status
()[source]¶ Query Data Monster servers for the most up-to-date status of this DataGroup. Calling this method will update the status field on this instance and return it.
Returns: The status of this DataGroup. Values can be one of the following: success if all Data Sources in the group have successfully loaded processing if any DataSource in the group is still processing error if any DataSource in the group is in an error state Note: error takes precedence over processing
Examples¶
Get Data Raw¶
Initialize a DataMonster
object:
dm = DataMonster(<key_id>, <secret_key>)
Initialize a Datasource
object (we will use a fake small data source from the provider XYZ for the purposes of this example):
ds = dm.get_datasource_by_name(
'XYZ Data Source'
)
Get raw data from the data source, producing a schema and pandas dataframe:
schema, df = dm.get_data_raw(ds)
The schema will contain metadata for the data source, with keys and values showing the roles different columns play in the data. In the case of the above data source:
>>> schema
{
'lower_date': ['period_start'],
'upper_date': ['period_end'],
'section_pk': ['section_pk'],
'value': ['panel_sales'],
'split': ['category']
}
This result indicates that the period_start
column represents the lower date for each data point, and so on.
Next, looking at the dataframe we see:
>>> df.head(2)
category | panel_sales | period_end | period_start | section_pk |
---|---|---|---|---|
Not specified | -0.1139 | 2017-01-01 | 2016-10-02 | 617 |
Not Specified | -0.0523 | 2018-07-02 | 2018-04-02 | 742 |
Category1 | -0.2233 | 2018-07-02 | 2018-04-02 | 742 |
Category1 | -0.4132 | 2019-03-31 | 2019-01-01 | 205 |
Note that the section_pk
column, which represents which company each data point refers to, is currently in the form of
an internal DataMonster identifier and is not particularly useful for external use. To convert to a more usable form, try:
comps = ds.companies
section_map = {}
for comp in comps:
section_map[comp.pk] = {"name": comp.name,
"ticker": comp.ticker}
def map_pk_to_ticker_and_name(section_map, df):
ticker_dict = {
pk: v["ticker"] for pk, v in section_map.items()
}
name_dict = {
pk: v["name"] for pk, v in section_map.items()
}
df["ticker"] = df["section_pk"].map(ticker_dict)
df["comp_name"] = df["section_pk"].map(name_dict)
df = df.drop(["section_pk"], axis=1)
return df
We can now use map_pk_to_ticker_and_name
to produce a more human-readable dataframe. For example:
>>> map_pk_to_ticker_and_name(section_map, df).head(2)
category | panel_sales | period_end | period_start, | ticker, | comp_name |
---|---|---|---|---|---|
Not specified | -0.1139 | 2017-01-01 | 2016-10-02 | PRTY | PARTY CITY |
Not Specified | -0.0523 | 2018-07-02 | 2018-04-02 | RUTH | RUTH’S HOSPITALITY GROUP |
Category1 | -0.2233 | 2018-07-02 | 2018-04-02 | RUTH | RUTH’S HOSPITALITY GROUP |
Category1 | -0.4132 | 2019-03-31 | 2019-01-01 | HD | HOME DEPOT |
Filtering to Specific Dimensions¶
The raw data endpoint supports filtering to specific values for dimensions by applying key value pairs as a dictionary, where the key is the dimension name and the value is a list of possibilities for that dimension. Using the example above, we could do this in a variety of ways.
Filtering to specific companies (in this case, Party City and Home Depot):
>>> filters = {'section_pk': [617, 205]}
>>> schema, df = dm.get_data_raw(ds, filters=filters)
category | panel_sales | period_end | period_start | section_pk |
---|---|---|---|---|
Not specified | -0.1139 | 2017-01-01 | 2016-10-02 | 617 |
Category1 | -0.4132 | 2019-03-31 | 2019-01-01 | 205 |
Filtering to specific dimension values (in this case, "Category1"
):
>>> filters = {'category': ['Category1']}
>>> schema, df = dm.get_data_raw(ds, filters=filters)
category | panel_sales | period_end | period_start | section_pk |
---|---|---|---|---|
Category1 | -0.2233 | 2018-07-02 | 2018-04-02 | 742 |
Category1 | -0.4132 | 2019-03-31 | 2019-01-01 | 205 |
Combining filters across dimensions (in this case, "Category1"
for Ruth’s Hospitality Group):
>>> filters = {'section_pk': [742], 'category': ['Category1']}
>>> schema, df = dm.get_data_raw(ds, filters=filters)
category | panel_sales | period_end | period_start | section_pk |
---|---|---|---|---|
Category1 | -0.2233 | 2018-07-02 | 2018-04-02 | 742 |
Aggregating Results on Different Cadences¶
The raw data endpoint can also take an optional Aggregation
object to request data with a time-based aggregation applied.
For example:
from datamonster_api import DataMonster, Aggregation
dm = DataMonster(<key_id>, <secret_key>)
# Get Company for Home Depot
hd = dm.get_company_by_ticker('hd')
# Get our Data Source
ds = dm.get_datasource_by_name('XYZ Data Source')
# Filter to Home Depot data and aggregate by Home Depot's fiscal quarters
filters = {'section_pk': [hd.pk]}
agg = Aggregation(period='fiscalQuarter', company=hd)
dm.get_data_raw(ds, filters=filters, aggregation=agg)
Get Dimensions for Datasource¶
Assuming dm
is a DataMonster
object, and given this fake data source and company:
datasource = next(
dm.get_datasources(query="Fake Data Source")
)
the_gap = dm.get_company_by_ticker("GPS")
this call to get_dimensions_for_datasource
:
dimset = dm.get_dimensions_for_datasource(
datasource,
filters={
"section_pk": the_gap.pk,
"category": "Banana Republic",
},
)
returns an iterable, dimset
, to a collection with just one dimensions dict.
Assuming from pprint import pprint
, the following loop:
for dim in dimset:
pprint(dim)
prettyprints the single dimension dict:
{
"max_date": "2019-06-21",
"min_date": "2014-01-01",
"row_count": 1998,
"split_combination": {
"category": "Banana Republic",
"country": "US",
"section_pk": 707,
},
}
How to Data Upload¶
The API is meant to programmatically refresh existing data uploads. The initial upload that specifies the schema must still be uploaded via UI. Currently the API supports the ability to search for data groups one owns, check the processing status, and uploading valid DataFrames to existing data groups.
for data_group in dm.get_data_groups():
print(data_group)
Alternatively, one can fetch a data group by its ID and view its status:
dg = dm.get_data_group_by_id(1012)
dg.get_current_status()
To view the columns (and schema) of the data group in order to verify the type of data we wish to re-upload:
dg.columns
To refresh the data, call start_data_refresh with a valid pandas.DataFrame object that matches the schema of the data group.
df = pandas.DataFrame({
'Start_Date': ['2019-01-01'],
'end date': ['2019-01-02'],
'dummy data 1': [1],
'dummy data_2': [1],
'Ticker': ['AAP'],
...
})
dg.start_data_refresh(df)
dg.get_current_status()
The status of the data group object will change to reflect the latest status
If the schema of dataframe does not match the schema expected by data group, an exception is raised with a useful message.