libcudf
24.02.00
|
Files | |
file | partition.hpp |
Strings partition APIs. | |
file | split.hpp |
file | split_re.hpp |
std::unique_ptr<table> cudf::strings::partition | ( | strings_column_view const & | input, |
string_scalar const & | delimiter = string_scalar("") , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a set of 3 columns by splitting each string using the specified delimiter.
The number of rows in the output columns will be the same as the input column. The first column will contain the first tokens of each string as a result of the split. The second column will contain the delimiter. The third column will contain the remaining characters of each string after the delimiter.
Any null string entries return corresponding null output columns.
input | Strings instance for this operation |
delimiter | UTF-8 encoded string indicating where to split each string. Default of empty string indicates split on whitespace. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned table's device memory |
std::unique_ptr<table> cudf::strings::rpartition | ( | strings_column_view const & | input, |
string_scalar const & | delimiter = string_scalar("") , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a set of 3 columns by splitting each string using the specified delimiter starting from the end of each string.
The number of rows in the output columns will be the same as the input column. The first column will contain the characters of each string before the last delimiter found. The second column will contain the delimiter. The third column will contain the remaining characters of each string after the delimiter.
Any null string entries return corresponding null output columns.
input | Strings instance for this operation |
delimiter | UTF-8 encoded string indicating where to split each string. Default of empty string indicates split on whitespace. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned table's device memory |
std::unique_ptr<table> cudf::strings::rsplit | ( | strings_column_view const & | strings_column, |
string_scalar const & | delimiter = string_scalar("") , |
||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a list of columns by splitting each string using the specified delimiter starting from the end of each string.
The number of rows in the output columns will be the same as the input column. The first column will contain the first tokens encountered in each string as a result of the split. Subsequent columns contain the next token strings. Null entries are added for a row where split results have been exhausted. The total number of columns will equal the maximum number of splits encountered on any string in the input column.
Any null string entries return corresponding null output columns.
strings_column | Strings instance for this operation |
delimiter | UTF-8 encoded string indicating the split points in each string; Default of empty string indicates split on whitespace. |
maxsplit | Maximum number of splits to perform; Default of -1 indicates all possible splits on each string. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned table's device memory |
std::unique_ptr<table> cudf::strings::rsplit_re | ( | strings_column_view const & | input, |
regex_program const & | prog, | ||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Splits strings elements into a table of strings columns using a regex_program's pattern to delimit each string starting from the end of the string.
Each element generates a vector of strings that are stored in corresponding rows in the output table – table[col,row] = token[col] of string[row]
where token
is the substring between each delimiter.
The number of rows in the output table will be the same as the number of elements in the input column. The resulting number of columns will be the maximum number of tokens found in any input row.
Splitting occurs by traversing starting from the end of the input string. The pattern
is used to identify the delimiters within a string and splitting stops when either maxsplit
or the beginning of the string is reached.
An empty input string will produce a corresponding empty string in the corresponding row of the first column. A null row will produce corresponding null rows in the output table.
The regex_program's regex_flags are ignored.
cudf::logic_error | if pattern is empty. |
input | A column of string elements to be split |
prog | Regex program instance |
maxsplit | Maximum number of splits to perform. Default of -1 indicates all possible splits on each string. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned result's device memory |
std::unique_ptr<column> cudf::strings::rsplit_record | ( | strings_column_view const & | strings, |
string_scalar const & | delimiter = string_scalar("") , |
||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Splits individual strings elements into a list of strings starting from the end of each string.
Each element generates an array of strings that are stored in an output lists column.
The number of elements in the output column will be the same as the number of elements in the input column. Each individual list item will contain the new strings for that row. The resulting number of strings in each row can vary from 0 to maxsplit + 1
.
The delimiter
is searched from end to beginning within each string and splitting stops when either maxsplit
or the beginning of the string is reached.
If a delimiter is not whitespace and occurs adjacent to another delimiter, an empty string is produced for that split occurrence. Likewise, a non-whitespace delimiter produces an empty string if it appears at the beginning or the end of a string.
Note that rsplit_record
and split_record
produce equivalent results for the default maxsplit
value.
A whitespace delimiter produces no empty strings.
A null string element will result in a null list item for that row.
cudf::logic_error | if delimiter is invalid. |
strings | A column of string elements to be split |
delimiter | The string to identify split points in each string; Default of empty string indicates split on whitespace. |
maxsplit | Maximum number of splits to perform; Default of -1 indicates all possible splits on each string |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned result's device memory |
std::unique_ptr<column> cudf::strings::rsplit_record_re | ( | strings_column_view const & | input, |
regex_program const & | prog, | ||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Splits strings elements into a list column of strings using the given regex_program to delimit each string starting from the end of the string.
Each element generates a vector of strings that are stored in an output lists column – list[row] = [token1, token2, ...] found in input[row]
where token
is a substring between delimiters.
The number of elements in the output column will be the same as the number of elements in the input column. Each individual list item will contain the new strings for that row. The resulting number of strings in each row can vary from 0 to maxsplit + 1
.
Splitting occurs by traversing starting from the end of the input string. The pattern
is used to identify the separation points within a string and splitting stops when either maxsplit
or the beginning of the string is reached.
An empty input string will produce a corresponding empty list item output row. A null row will produce a corresponding null output row.
The regex_program's regex_flags are ignored.
See the Regex Features page for details on patterns supported by this API.
cudf::logic_error | if pattern is empty. |
input | A column of string elements to be split |
prog | Regex program instance |
maxsplit | Maximum number of splits to perform. Default of -1 indicates all possible splits on each string. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned result's device memory |
std::unique_ptr<table> cudf::strings::split | ( | strings_column_view const & | strings_column, |
string_scalar const & | delimiter = string_scalar("") , |
||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a list of columns by splitting each string using the specified delimiter.
The number of rows in the output columns will be the same as the input column. The first column will contain the first tokens of each string as a result of the split. Subsequent columns contain the next token strings. Null entries are added for a row where split results have been exhausted. The total number of columns will equal the maximum number of splits encountered on any string in the input column.
Any null string entries return corresponding null output columns.
strings_column | Strings instance for this operation |
delimiter | UTF-8 encoded string indicating the split points in each string; Default of empty string indicates split on whitespace. |
maxsplit | Maximum number of splits to perform; Default of -1 indicates all possible splits on each string. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned table's device memory |
std::unique_ptr<table> cudf::strings::split_re | ( | strings_column_view const & | input, |
regex_program const & | prog, | ||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Splits strings elements into a table of strings columns using a regex_program's pattern to delimit each string.
Each element generates a vector of strings that are stored in corresponding rows in the output table – table[col,row] = token[col] of strings[row]
where token
is a substring between delimiters.
The number of rows in the output table will be the same as the number of elements in the input column. The resulting number of columns will be the maximum number of tokens found in any input row.
The pattern
is used to identify the delimiters within a string and splitting stops when either maxsplit
or the end of the string is reached.
An empty input string will produce a corresponding empty string in the corresponding row of the first column. A null row will produce corresponding null rows in the output table.
The regex_program's regex_flags are ignored.
cudf::logic_error | if pattern is empty. |
input | A column of string elements to be split |
prog | Regex program instance |
maxsplit | Maximum number of splits to perform. Default of -1 indicates all possible splits on each string. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned result's device memory |
std::unique_ptr<column> cudf::strings::split_record | ( | strings_column_view const & | strings, |
string_scalar const & | delimiter = string_scalar("") , |
||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Splits individual strings elements into a list of strings.
Each element generates an array of strings that are stored in an output lists column.
The number of elements in the output column will be the same as the number of elements in the input column. Each individual list item will contain the new strings for that row. The resulting number of strings in each row can vary from 0 to maxsplit + 1
.
The delimiter
is searched within each string from beginning to end and splitting stops when either maxsplit
or the end of the string is reached.
If a delimiter is not whitespace and occurs adjacent to another delimiter, an empty string is produced for that split occurrence. Likewise, a non-whitespace delimiter produces an empty string if it appears at the beginning or the end of a string.
A whitespace delimiter produces no empty strings.
A null string element will result in a null list item for that row.
cudf::logic_error | if delimiter is invalid. |
strings | A column of string elements to be split |
delimiter | The string to identify split points in each string; Default of empty string indicates split on whitespace. |
maxsplit | Maximum number of splits to perform; Default of -1 indicates all possible splits on each string |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned result's device memory |
std::unique_ptr<column> cudf::strings::split_record_re | ( | strings_column_view const & | input, |
regex_program const & | prog, | ||
size_type | maxsplit = -1 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Splits strings elements into a list column of strings using the given regex_program to delimit each string.
Each element generates an array of strings that are stored in an output lists column – list[row] = [token1, token2, ...] found in input[row]
where token
is a substring between delimiters.
The number of elements in the output column will be the same as the number of elements in the input column. Each individual list item will contain the new strings for that row. The resulting number of strings in each row can vary from 0 to maxsplit + 1
.
The pattern
is used to identify the delimiters within a string and splitting stops when either maxsplit
or the end of the string is reached.
An empty input string will produce a corresponding empty list item output row. A null row will produce a corresponding null output row.
The regex_program's regex_flags are ignored.
cudf::logic_error | if pattern is empty. |
See the Regex Features page for details on patterns supported by this API.
input | A column of string elements to be split |
prog | Regex program instance |
maxsplit | Maximum number of splits to perform. Default of -1 indicates all possible splits on each string. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned result's device memory |