sparkly.index package¶
Submodules¶
sparkly.index.index_base module¶
sparkly.index.index_config module¶
-
class
sparkly.index.index_config.IndexConfig(*, store_vectors=False, id_col='_id')¶ Bases:
objectAttributes: id_colThe unique id column for the records in the index this must be a 32 or 64 bit integer
is_frozenReturns
store_vectorsTrue if the term vectors in the index should be stored, else False
Methods
add_concat_field(field, concat_fields, analyzers)Add a new concat field to be indexed with this config add_field(field, analyzers)Add a new field to be indexed with this config freeze()Returns: from_json(data)construct an index config from a dict or json string, get_analyzed_fields([query_spec])Get the fields used by the index or query_spec. remove_field(field)remove a field from the config to_dict()convert this IndexConfig to a dictionary which can easily to_json()Dump this IndexConfig to a valid json strings -
add_concat_field(field: str, concat_fields, analyzers)¶ Add a new concat field to be indexed with this config
Parameters: - field : str
The name of the field that will be added to the index
- concat_fields : set, list or tuple of strs
the fields in the table that will be concatenated together to create field
- analyzers : set, list or tuple of str
The names of the analyzers that will be used to index the field
-
add_field(field: str, analyzers)¶ Add a new field to be indexed with this config
Parameters: - field : str
The name of the field in the table to the index
- analyzers : set, list or tuple of str
The names of the analyzers that will be used to index the field
-
freeze()¶ Returns: - IndexConfig
a frozen deepcopy of this index config
-
classmethod
from_json(data)¶ construct an index config from a dict or json string, see IndexConfig.to_dict for expected format
Returns: - IndexConfig
-
get_analyzed_fields(query_spec=None)¶ Get the fields used by the index or query_spec. If query_spec is None, the fields that are used by the index are returned.
Parameters: - query_spec : QuerySpec, optional
if provided, the fields that are used by query_spec in creating a query
Returns: - list of str
the fields used
-
id_col¶ The unique id column for the records in the index this must be a 32 or 64 bit integer
-
is_frozen¶ Returns: - bool
True if this index is frozen (not modifiable) else False
-
remove_field(field)¶ remove a field from the config
Parameters: - field : str
the field to be removed from the config
Returns: - bool
True if the field existed else False
-
store_vectors¶ True if the term vectors in the index should be stored, else False
-
to_dict()¶ convert this IndexConfig to a dictionary which can easily be stored as json
Returns: - dict
A dictionary representation of this IndexConfig
-
to_json()¶ Dump this IndexConfig to a valid json strings
Returns: - str
sparkly.index.lucene_index module¶
-
class
sparkly.index.lucene_index.LuceneIndex(index_path)¶ Bases:
sparkly.index.index_base.IndexAttributes: configthe index config used to build this index
is_builtTrue if this index has been built else False
is_on_sparkTrue if this index has been distributed to the spark workers else False
query_genthe query generator for this index
Methods
build(df, config)build the index, indexing df according to config deinit()release resources held by this Index get_full_query_spec([cross_fields])get a query spec that uses all indexed columns init()initialize the index for usage in a spark worker. This method search(doc, query_spec, limit)perform search for doc according to query_spec return at most limit docs search_many(docs, query_spec, limit)perform search for the documents in docs according to query_spec return at most limit docs to_spark()send this index to the spark cluster. subsequent uses will read files from id_to_lucene_id score_docs -
ANALYZERS= {'3gram': <class 'sparkly.analysis.Gram3Analyzer'>, 'shingle': <function get_shingle_analyzer at 0x7fc4d1632268>, 'standard': <function get_standard_analyzer_no_stop_words at 0x7fc4d1632378>, 'standard36edgegram': <class 'sparkly.analysis.StandardEdgeGram36Analyzer'>, 'standard_stopwords': <class 'org.apache.lucene.analysis.standard.StandardAnalyzer'>, 'unfiltered_5gram': <class 'sparkly.analysis.UnfilteredGram5Analyzer'>}¶
-
LUCENE_DIR= 'LUCENE_INDEX'¶
-
PY_META_FILE= 'PY_META.json'¶
-
build(df, config)¶ build the index, indexing df according to config
Parameters: - df : pd.DataFrame or pyspark DataFrame
the table that will be indexed, if a pyspark DataFrame is provided, the build will be done in parallel for suffciently large tables
- config : IndexConfig
the config for the index being built
-
config¶ the index config used to build this index
Returns: - IndexConfig
-
deinit()¶ release resources held by this Index
-
get_full_query_spec(cross_fields=False)¶ get a query spec that uses all indexed columns
Parameters: - cross_fields : bool, default = False
if True return <FIELD> -> <CONCAT FIELD> in the query spec if FIELD is used to create CONCAT_FIELD else just return <FIELD> -> <FIELD> and <CONCAT_FIELD> -> <CONCAT_FIELD> pairs
Returns: - QuerySpec
-
id_to_lucene_id(i)¶
-
init()¶ initialize the index for usage in a spark worker. This method must be called before calling search or search_many.
-
is_built¶ True if this index has been built else False
Returns: - bool
-
is_on_spark¶ True if this index has been distributed to the spark workers else False
Returns: - bool
-
query_gen¶ the query generator for this index
Returns: - LuceneQueryGenerator
-
score_docs(ids, queries: dict)¶
-
search(doc, query_spec, limit)¶ perform search for doc according to query_spec return at most limit docs
Parameters: - doc : pd.Series or dict
the record for searching
- query_spec : QuerySpec
the query template that specifies how to search for doc
- limit : int
the maximum number of documents returned
Returns: - QueryResult
the documents matching the doc
-
search_many(docs, query_spec, limit)¶ perform search for the documents in docs according to query_spec return at most limit docs per document docs.
Parameters: - doc : pd.DataFrame
the records for searching
- query_spec : QuerySpec
the query template that specifies how to search for doc
- limit : int
the maximum number of documents returned
Returns: - pd.DataFrame
the search results for each document in docs, indexed by docs.index
-
to_spark()¶ send this index to the spark cluster. subsequent uses will read files from SparkFiles, allowing spark workers to perform search with a local copy of the index.