sparkly.index package¶

Submodules¶

sparkly.index.index_base module¶

class sparkly.index.index_base.Index¶

Bases: abc.ABC

Attributes:	config

Methods

build
search
search_many

build(df, config) → None¶

config¶

search(doc, query_spec, limit)¶

search_many(docs, query_spec, limit)¶

sparkly.index.index_config module¶

class sparkly.index.index_config.IndexConfig(*, store_vectors=False, id_col='_id')¶

Bases: object

Attributes:	`id_col` The unique id column for the records in the index this must be a 32 or 64 bit integer `is_frozen` Returns `store_vectors` True if the term vectors in the index should be stored, else False

Methods

add_concat_field(field, concat_fields, analyzers) Add a new concat field to be indexed with this config

add_field(field, analyzers) Add a new field to be indexed with this config

freeze()

Returns:

from_json(data) construct an index config from a dict or json string,

get_analyzed_fields([query_spec]) Get the fields used by the index or query_spec.

remove_field(field) remove a field from the config

to_dict() convert this IndexConfig to a dictionary which can easily

to_json() Dump this IndexConfig to a valid json strings

add_concat_field(field: str, concat_fields, analyzers)¶

Add a new concat field to be indexed with this config

Parameters:	field : str The name of the field that will be added to the index concat_fields : set, list or tuple of strs the fields in the table that will be concatenated together to create field analyzers : set, list or tuple of str The names of the analyzers that will be used to index the field

add_field(field: str, analyzers)¶

Add a new field to be indexed with this config

Parameters:	field : str The name of the field in the table to the index analyzers : set, list or tuple of str The names of the analyzers that will be used to index the field

freeze()¶

Returns:	IndexConfig a frozen deepcopy of this index config

classmethod from_json(data)¶

construct an index config from a dict or json string, see IndexConfig.to_dict for expected format

Returns:	IndexConfig

get_analyzed_fields(query_spec=None)¶

Get the fields used by the index or query_spec. If query_spec is None, the fields that are used by the index are returned.

Parameters:	query_spec : QuerySpec, optional if provided, the fields that are used by query_spec in creating a query
Returns:	list of str the fields used

id_col¶: The unique id column for the records in the index this must be a 32 or 64 bit integer

is_frozen¶

Returns:	bool True if this index is frozen (not modifiable) else False

remove_field(field)¶

remove a field from the config

Parameters:	field : str the field to be removed from the config
Returns:	bool True if the field existed else False

store_vectors¶: True if the term vectors in the index should be stored, else False

to_dict()¶

convert this IndexConfig to a dictionary which can easily be stored as json

Returns:	dict A dictionary representation of this IndexConfig

to_json()¶

Dump this IndexConfig to a valid json strings

Returns:	str

sparkly.index.lucene_index module¶

class sparkly.index.lucene_index.LuceneIndex(index_path)¶

Bases: sparkly.index.index_base.Index

Attributes:	`config` the index config used to build this index `is_built` True if this index has been built else False `is_on_spark` True if this index has been distributed to the spark workers else False `query_gen` the query generator for this index

Methods

`build`(df, config)	build the index, indexing df according to config
`deinit`()	release resources held by this Index
`get_full_query_spec`([cross_fields])	get a query spec that uses all indexed columns
`init`()	initialize the index for usage in a spark worker. This method
`search`(doc, query_spec, limit)	perform search for doc according to query_spec return at most limit docs
`search_many`(docs, query_spec, limit)	perform search for the documents in docs according to query_spec return at most limit docs
`to_spark`()	send this index to the spark cluster. subsequent uses will read files from

id_to_lucene_id
score_docs

ANALYZERS = {'3gram': <class 'sparkly.analysis.Gram3Analyzer'>, 'shingle': <function get_shingle_analyzer at 0x7fc4d1632268>, 'standard': <function get_standard_analyzer_no_stop_words at 0x7fc4d1632378>, 'standard36edgegram': <class 'sparkly.analysis.StandardEdgeGram36Analyzer'>, 'standard_stopwords': <class 'org.apache.lucene.analysis.standard.StandardAnalyzer'>, 'unfiltered_5gram': <class 'sparkly.analysis.UnfilteredGram5Analyzer'>}¶

LUCENE_DIR = 'LUCENE_INDEX'¶

PY_META_FILE = 'PY_META.json'¶

build(df, config)¶

build the index, indexing df according to config

Parameters:	df : pd.DataFrame or pyspark DataFrame the table that will be indexed, if a pyspark DataFrame is provided, the build will be done in parallel for suffciently large tables config : IndexConfig the config for the index being built

config¶

the index config used to build this index

Returns:	IndexConfig

deinit()¶: release resources held by this Index

get_full_query_spec(cross_fields=False)¶

get a query spec that uses all indexed columns

Parameters:	cross_fields : bool, default = False if True return <FIELD> -> <CONCAT FIELD> in the query spec if FIELD is used to create CONCAT_FIELD else just return <FIELD> -> <FIELD> and <CONCAT_FIELD> -> <CONCAT_FIELD> pairs
Returns:	QuerySpec

id_to_lucene_id(i)¶

init()¶: initialize the index for usage in a spark worker. This method must be called before calling search or search_many.

is_built¶

True if this index has been built else False

Returns:	bool

is_on_spark¶

True if this index has been distributed to the spark workers else False

Returns:	bool

query_gen¶

the query generator for this index

Returns:	LuceneQueryGenerator

score_docs(ids, queries: dict)¶

search(doc, query_spec, limit)¶

perform search for doc according to query_spec return at most limit docs

Parameters:	doc : pd.Series or dict the record for searching query_spec : QuerySpec the query template that specifies how to search for doc limit : int the maximum number of documents returned
Returns:	QueryResult the documents matching the doc

search_many(docs, query_spec, limit)¶

perform search for the documents in docs according to query_spec return at most limit docs per document docs.

Parameters:	doc : pd.DataFrame the records for searching query_spec : QuerySpec the query template that specifies how to search for doc limit : int the maximum number of documents returned
Returns:	pd.DataFrame the search results for each document in docs, indexed by docs.index

to_spark()¶: send this index to the spark cluster. subsequent uses will read files from SparkFiles, allowing spark workers to perform search with a local copy of the index.

sparkly.index package¶

Submodules¶

sparkly.index.index_base module¶

sparkly.index.index_config module¶

sparkly.index.lucene_index module¶

Module contents¶