Vita
Classes | Public Types | Public Member Functions | Public Attributes | List of all members
vita::dataframe Class Reference

A 2-dimensional labeled data structure with columns of potentially different types. More...

#include <dataframe.h>

Classes

class  columns_info
 Information about the collection of columns (type, name, output index). More...
 
struct  example
 Stores a single element (row) of the dataset. More...
 
class  params
 

Public Types

using const_iterator = examples_t::const_iterator
 
using difference_type = examples_t::difference_type
 
using examples_t = std::vector< example >
 
using filter_hook_t = std::function< bool(record_t &)>
 A filter and transform function (returns true for records that should be loaded and, possibly, transform its input parameter). More...
 
using iterator = examples_t::iterator
 
using record_t = std::vector< std::string >
 Raw input record. More...
 
using value_type = examples_t::value_type
 

Public Member Functions

iterator begin ()
 
const_iterator begin () const
 
std::string class_name (class_t) const
 
class_t classes () const
 
void clear ()
 Removes all elements from the container. More...
 
 dataframe ()
 New empty data instance. More...
 
 dataframe (const std::filesystem::path &)
 
 dataframe (const std::filesystem::path &, const params &)
 New datafame instance containing the learning collection from a file. More...
 
 dataframe (std::istream &)
 
 dataframe (std::istream &, const params &)
 New dataframe instance containing the learning collection from a stream. More...
 
bool empty () const
 
iterator end ()
 
const_iterator end () const
 
iterator erase (iterator, iterator)
 Removes specified elements from the dataframe. More...
 
value_type & front ()
 Returns a reference to the first element in the dataframe. More...
 
value_type front () const
 Returns a constant reference to the first element in the dataframe. More...
 
bool is_valid () const
 
bool operator! () const
 
void push_back (const example &)
 Appends the given element to the end of the active dataset. More...
 
std::size_t read (const std::filesystem::path &)
 
std::size_t read (const std::filesystem::path &, const params &)
 Loads the content of a file into the active dataset. More...
 
std::size_t read_csv (std::istream &)
 
std::size_t read_csv (std::istream &, params)
 Loads a CSV file into the active dataset. More...
 
std::size_t read_xrff (std::istream &)
 
std::size_t read_xrff (std::istream &, const params &)
 Loads a XRFF file from a stream into the dataframe. More...
 
std::size_t size () const
 
unsigned variables () const
 

Public Attributes

columns_info columns
 

Detailed Description

A 2-dimensional labeled data structure with columns of potentially different types.

You can think of it like a spreadsheet or SQL table.

Dataframe:

See also
https://github.com/morinim/vita/wiki/dataframe

Definition at line 47 of file dataframe.h.

Member Typedef Documentation

◆ const_iterator

using vita::dataframe::const_iterator = examples_t::const_iterator

Definition at line 121 of file dataframe.h.

◆ difference_type

using vita::dataframe::difference_type = examples_t::difference_type

Definition at line 122 of file dataframe.h.

◆ examples_t

using vita::dataframe::examples_t = std::vector<example>

Definition at line 55 of file dataframe.h.

◆ filter_hook_t

using vita::dataframe::filter_hook_t = std::function<bool (record_t &)>

A filter and transform function (returns true for records that should be loaded and, possibly, transform its input parameter).

Definition at line 65 of file dataframe.h.

◆ iterator

using vita::dataframe::iterator = examples_t::iterator

Definition at line 120 of file dataframe.h.

◆ record_t

using vita::dataframe::record_t = std::vector<std::string>

Raw input record.

The ETL chain is:

‍FILE -> record_t -> example –(vita::push_back)--> vita::dataframe

Definition at line 61 of file dataframe.h.

◆ value_type

using vita::dataframe::value_type = examples_t::value_type

Definition at line 56 of file dataframe.h.

Constructor & Destructor Documentation

◆ dataframe() [1/5]

vita::dataframe::dataframe ( )

New empty data instance.

Definition at line 181 of file dataframe.cc.

◆ dataframe() [2/5]

vita::dataframe::dataframe ( std::istream &  is)
explicit

Definition at line 200 of file dataframe.cc.

◆ dataframe() [3/5]

vita::dataframe::dataframe ( std::istream &  is,
const params p 
)

New dataframe instance containing the learning collection from a stream.

Parameters
[in]isinput stream
[in]padditional, optional, parameters (see params structure)
Remarks
Data from the input stream must be in CSV format.

Definition at line 194 of file dataframe.cc.

◆ dataframe() [4/5]

vita::dataframe::dataframe ( const std::filesystem::path &  fn)
explicit

Definition at line 217 of file dataframe.cc.

◆ dataframe() [5/5]

vita::dataframe::dataframe ( const std::filesystem::path &  fn,
const params p 
)

New datafame instance containing the learning collection from a file.

Parameters
[in]fnname of the file containing the learning collection (CSV / XRFF format)
[in]padditional, optional, parameters (see params structure)

Definition at line 210 of file dataframe.cc.

Member Function Documentation

◆ begin() [1/2]

dataframe::iterator vita::dataframe::begin ( )
Returns
reference to the first element of the active dataset

Definition at line 235 of file dataframe.cc.

◆ begin() [2/2]

dataframe::const_iterator vita::dataframe::begin ( ) const
Returns
a constant reference to the first element of the dataset

Definition at line 243 of file dataframe.cc.

◆ class_name()

std::string vita::dataframe::class_name ( class_t  i) const
Parameters
[in]ithe encoded (dataframe::encode()) value of a class
Returns
the name of the class encoded by i (or an empty string if such class cannot be find)

Definition at line 423 of file dataframe.cc.

◆ classes()

class_t vita::dataframe::classes ( ) const
Returns
number of classes of the problem (== 0 for a symbolic regression problem, > 1 for a classification problem)

Definition at line 308 of file dataframe.cc.

◆ clear()

void vita::dataframe::clear ( )

Removes all elements from the container.

Invalidates any references, pointers or iterators referring to contained examples. Any past-the-end iterators are also invalidated.

Leaves the associated metadata unchanged.

Definition at line 227 of file dataframe.cc.

◆ empty()

bool vita::dataframe::empty ( ) const
Returns
true if the dataframe is empty

Definition at line 299 of file dataframe.cc.

◆ end() [1/2]

dataframe::iterator vita::dataframe::end ( )
Returns
a reference to the sentinel element of the active dataset

Definition at line 251 of file dataframe.cc.

◆ end() [2/2]

dataframe::const_iterator vita::dataframe::end ( ) const
Returns
a constant reference to the sentinel element of the active dataset

Definition at line 259 of file dataframe.cc.

◆ erase()

dataframe::iterator vita::dataframe::erase ( iterator  first,
iterator  last 
)

Removes specified elements from the dataframe.

Parameters
[in]firstfirst element of the range
[in]lastend of the range
Returns
iterator following the last removed element

Definition at line 772 of file dataframe.cc.

◆ front() [1/2]

dataframe::value_type & vita::dataframe::front ( )

Returns a reference to the first element in the dataframe.

Returns
a reference to the first element in the dataframe
Remarks
Calling front on an empty dataframe is undefined.

Definition at line 283 of file dataframe.cc.

◆ front() [2/2]

dataframe::value_type vita::dataframe::front ( ) const

Returns a constant reference to the first element in the dataframe.

Returns
a constant reference to the first element int the dataframe
Remarks
Calling front on an empty dataframe is undefined.

Definition at line 271 of file dataframe.cc.

◆ is_valid()

bool vita::dataframe::is_valid ( ) const
Returns
true if the object passes the internal consistency check

Definition at line 780 of file dataframe.cc.

◆ operator!()

bool vita::dataframe::operator! ( ) const
Returns
true if the current dataset is empty

Definition at line 760 of file dataframe.cc.

◆ push_back()

void vita::dataframe::push_back ( const example e)

Appends the given element to the end of the active dataset.

Parameters
[in]ethe value of the element to append

Definition at line 332 of file dataframe.cc.

◆ read() [1/2]

std::size_t vita::dataframe::read ( const std::filesystem::path &  fn)

Definition at line 752 of file dataframe.cc.

◆ read() [2/2]

std::size_t vita::dataframe::read ( const std::filesystem::path &  fn,
const params p 
)

Loads the content of a file into the active dataset.

Parameters
[in]fnname of the file containing the data set (CSV / XRFF format)
[in]padditional, optional, parameters (see params structure)
Returns
number of lines parsed
Exceptions
std::invalid_argumentmissing dataset file name
Note
Test set can have an empty output value.

Definition at line 742 of file dataframe.cc.

◆ read_csv() [1/2]

std::size_t vita::dataframe::read_csv ( std::istream &  from)

Definition at line 726 of file dataframe.cc.

◆ read_csv() [2/2]

std::size_t vita::dataframe::read_csv ( std::istream &  from,
params  p 
)

Loads a CSV file into the active dataset.

Parameters
[in]fromthe csv stream
[in]padditional, optional, parameters (see params structure)
Returns
number of lines parsed (0 in case of errors)
Exceptions
exception::insufficient_dataempty / undersized data file

General conventions:

  • NO HEADER ROW is allowed;
  • only one example is allowed per line. A single example cannot contain newlines and cannot span multiple lines. Note than CSV standard (e.g. http://en.wikipedia.org/wiki/Comma-separated_values) allows for the newline character \n to be part of a csv field if the field is surrounded by quotes;
  • columns are separated by commas. Commas inside a quoted string aren't column delimiters;
  • the column containing the labels (numeric or string) for the examples can be specified by the user; if not specified, the the first column is the default. If the label is numeric Vita assumes a REGRESSION model; if it's a string, a CATEGORIZATION (i.e. classification) model is assumed.
  • each column must describe the same kind of information;
  • the column order of features in the table does not weight the results. The first feature is not weighted any more than the last;
  • as a best practice, remove punctuation (other than apostrophes) from your data. This is because commas, periods and other punctuation rarely add meaning to the training data, but are treated as meaningful elements by the learning engine. For example, "end." is not matched to "end";
  • TEXT STRINGS:
    • place double quotes around all text strings;
    • text matching is case-sensitive: "wine" is different from "Wine.";
    • if a string contains a double quote, the double quote must be escaped with another double quote, for example: "sentence with a ""double"" quote inside";
  • NUMERIC VALUES:
    • both integer and decimal values are supported;
    • numbers in quotes without whitespace will be treated as numbers, even if they are in quotation marks. Multiple numeric values within quotation marks in the same field will be treated as a string. For example: Numbers: "2", "12", "236" Strings: "2 12", "a 23"
Note
Test set can have an empty output value.

Definition at line 677 of file dataframe.cc.

◆ read_xrff() [1/2]

std::size_t vita::dataframe::read_xrff ( std::istream &  in)

Definition at line 475 of file dataframe.cc.

◆ read_xrff() [2/2]

std::size_t vita::dataframe::read_xrff ( std::istream &  in,
const params p 
)

Loads a XRFF file from a stream into the dataframe.

Parameters
[in]inthe xrff stream
[in]padditional, optional, parameters (see params structure)
Returns
number of lines parsed (0 in case of errors)
Exceptions
exception::data_formatwrong data format for data file
See also
dataframe::read_xrff(tinyxml2::XMLDocument &) for details.

Definition at line 464 of file dataframe.cc.

◆ size()

std::size_t vita::dataframe::size ( ) const
Returns
the size of the active dataset

Definition at line 291 of file dataframe.cc.

◆ variables()

unsigned vita::dataframe::variables ( ) const
Returns
input vector dimension
Note
data class supports just one output for every instance, so, if the dataset is not empty, variables() + 1 == columns.size().

Definition at line 319 of file dataframe.cc.

Member Data Documentation

◆ columns

columns_info vita::dataframe::columns

Definition at line 157 of file dataframe.h.


The documentation for this class was generated from the following files: