sparkly.index package

Submodules

sparkly.index.index_base module

class sparkly.index.index_base.Index

Bases: abc.ABC

Attributes:
config

Methods

build  
search  
search_many  
build(df, config) → None
config
search(doc, query_spec, limit)
search_many(docs, query_spec, limit)

sparkly.index.index_config module

class sparkly.index.index_config.IndexConfig(*, store_vectors=False, id_col='_id')

Bases: object

Attributes:
id_col

The unique id column for the records in the index this must be a 32 or 64 bit integer

is_frozen

Returns

store_vectors

True if the term vectors in the index should be stored, else False

Methods

add_concat_field(field, concat_fields, analyzers) Add a new concat field to be indexed with this config
add_field(field, analyzers) Add a new field to be indexed with this config
freeze()
Returns:
from_json(data) construct an index config from a dict or json string,
get_analyzed_fields([query_spec]) Get the fields used by the index or query_spec.
remove_field(field) remove a field from the config
to_dict() convert this IndexConfig to a dictionary which can easily
to_json() Dump this IndexConfig to a valid json strings
add_concat_field(field: str, concat_fields, analyzers)

Add a new concat field to be indexed with this config

Parameters:
field : str

The name of the field that will be added to the index

concat_fields : set, list or tuple of strs

the fields in the table that will be concatenated together to create field

analyzers : set, list or tuple of str

The names of the analyzers that will be used to index the field

add_field(field: str, analyzers)

Add a new field to be indexed with this config

Parameters:
field : str

The name of the field in the table to the index

analyzers : set, list or tuple of str

The names of the analyzers that will be used to index the field

freeze()
Returns:
IndexConfig

a frozen deepcopy of this index config

classmethod from_json(data)

construct an index config from a dict or json string, see IndexConfig.to_dict for expected format

Returns:
IndexConfig
get_analyzed_fields(query_spec=None)

Get the fields used by the index or query_spec. If query_spec is None, the fields that are used by the index are returned.

Parameters:
query_spec : QuerySpec, optional

if provided, the fields that are used by query_spec in creating a query

Returns:
list of str

the fields used

id_col

The unique id column for the records in the index this must be a 32 or 64 bit integer

is_frozen
Returns:
bool

True if this index is frozen (not modifiable) else False

remove_field(field)

remove a field from the config

Parameters:
field : str

the field to be removed from the config

Returns:
bool

True if the field existed else False

store_vectors

True if the term vectors in the index should be stored, else False

to_dict()

convert this IndexConfig to a dictionary which can easily be stored as json

Returns:
dict

A dictionary representation of this IndexConfig

to_json()

Dump this IndexConfig to a valid json strings

Returns:
str

sparkly.index.lucene_index module

class sparkly.index.lucene_index.LuceneIndex(index_path)

Bases: sparkly.index.index_base.Index

Attributes:
config

the index config used to build this index

is_built

True if this index has been built else False

is_on_spark

True if this index has been distributed to the spark workers else False

query_gen

the query generator for this index

Methods

build(df, config) build the index, indexing df according to config
deinit() release resources held by this Index
get_full_query_spec([cross_fields]) get a query spec that uses all indexed columns
init() initialize the index for usage in a spark worker. This method
search(doc, query_spec, limit) perform search for doc according to query_spec return at most limit docs
search_many(docs, query_spec, limit) perform search for the documents in docs according to query_spec return at most limit docs
to_spark() send this index to the spark cluster. subsequent uses will read files from
id_to_lucene_id  
score_docs  
ANALYZERS = {'3gram': <class 'sparkly.analysis.Gram3Analyzer'>, 'shingle': <function get_shingle_analyzer at 0x7fc4d1632268>, 'standard': <function get_standard_analyzer_no_stop_words at 0x7fc4d1632378>, 'standard36edgegram': <class 'sparkly.analysis.StandardEdgeGram36Analyzer'>, 'standard_stopwords': <class 'org.apache.lucene.analysis.standard.StandardAnalyzer'>, 'unfiltered_5gram': <class 'sparkly.analysis.UnfilteredGram5Analyzer'>}
LUCENE_DIR = 'LUCENE_INDEX'
PY_META_FILE = 'PY_META.json'
build(df, config)

build the index, indexing df according to config

Parameters:
df : pd.DataFrame or pyspark DataFrame

the table that will be indexed, if a pyspark DataFrame is provided, the build will be done in parallel for suffciently large tables

config : IndexConfig

the config for the index being built

config

the index config used to build this index

Returns:
IndexConfig
deinit()

release resources held by this Index

get_full_query_spec(cross_fields=False)

get a query spec that uses all indexed columns

Parameters:
cross_fields : bool, default = False

if True return <FIELD> -> <CONCAT FIELD> in the query spec if FIELD is used to create CONCAT_FIELD else just return <FIELD> -> <FIELD> and <CONCAT_FIELD> -> <CONCAT_FIELD> pairs

Returns:
QuerySpec
id_to_lucene_id(i)
init()

initialize the index for usage in a spark worker. This method must be called before calling search or search_many.

is_built

True if this index has been built else False

Returns:
bool
is_on_spark

True if this index has been distributed to the spark workers else False

Returns:
bool
query_gen

the query generator for this index

Returns:
LuceneQueryGenerator
score_docs(ids, queries: dict)
search(doc, query_spec, limit)

perform search for doc according to query_spec return at most limit docs

Parameters:
doc : pd.Series or dict

the record for searching

query_spec : QuerySpec

the query template that specifies how to search for doc

limit : int

the maximum number of documents returned

Returns:
QueryResult

the documents matching the doc

search_many(docs, query_spec, limit)

perform search for the documents in docs according to query_spec return at most limit docs per document docs.

Parameters:
doc : pd.DataFrame

the records for searching

query_spec : QuerySpec

the query template that specifies how to search for doc

limit : int

the maximum number of documents returned

Returns:
pd.DataFrame

the search results for each document in docs, indexed by docs.index

to_spark()

send this index to the spark cluster. subsequent uses will read files from SparkFiles, allowing spark workers to perform search with a local copy of the index.

Module contents