sparkly.index_optimizer package¶
Submodules¶
sparkly.index_optimizer.index_optimizer module¶
-
class
sparkly.index_optimizer.index_optimizer.IndexOptimizer(is_dedupe, scorer=None, conf=0.99)¶ Bases:
objecta class for optimizing the search columns and analyzers for indexes
Attributes: - index
Methods
make_index_config(df[, id_col])create the starting index config which can then be used to for optimization optimize(index, search_df)Parameters: -
index¶
-
make_index_config(df, id_col='_id') → sparkly.index.index_config.IndexConfig¶ create the starting index config which can then be used to for optimization throws out any columns where the average number of whitespace delimited tokens are >= 50
Parameters: - df : pyspark.sql.DataFrame
the dataframe that we want to generate a config for
- id_col : str
the unique id column for the records in the dataframe
-
optimize(index: sparkly.index.index_base.Index, search_df) → sparkly.query_generator.query_spec.QuerySpec¶ Parameters: - index : Index
the index that will have an optimzed query spec created for it
- search_df : pyspark.sql.DataFrame:
the records that will be used to choose the query spec
Returns: - QuerySpec
a query spec optimized for searching for search_df using index
sparkly.index_optimizer.query_scorer module¶
-
class
sparkly.index_optimizer.query_scorer.AUCQueryScorer¶ Bases:
sparkly.index_optimizer.query_scorer.QueryScorerMethods
score_query_result score_query_results -
score_query_result(query_result, query_spec, drop_first) → float¶
-
score_query_results(query_results, query_spec, drop_first) → list¶
-
-
class
sparkly.index_optimizer.query_scorer.QueryScorer¶ Bases:
abc.ABCMethods
score_query_results(query_results, query_spec)score_query_result -
score_query_result(query_result, query_spec) → float¶
-
score_query_results(query_results, query_spec) → list¶
-
-
class
sparkly.index_optimizer.query_scorer.RankQueryScorer(threshold, k)¶ Bases:
sparkly.index_optimizer.query_scorer.QueryScorerMethods
score_query_result score_query_results -
score_query_result(query_result, query_spec) → float¶
-
score_query_results(query_results, query_spec) → list¶
-
-
sparkly.index_optimizer.query_scorer.compute_wilcoxon_score(x, y)¶
-
sparkly.index_optimizer.query_scorer.score_query_result(scores, drop_first=False)¶
-
sparkly.index_optimizer.query_scorer.score_query_result_sum(scores)¶
-
sparkly.index_optimizer.query_scorer.score_query_results(query_results)¶