sparkly.index_optimizer package¶

Submodules¶

sparkly.index_optimizer.index_optimizer module¶

class sparkly.index_optimizer.index_optimizer.IndexOptimizer(is_dedupe, scorer=None, conf=0.99)¶

Bases: object

a class for optimizing the search columns and analyzers for indexes

Attributes:	index

Methods

make_index_config(df[, id_col]) create the starting index config which can then be used to for optimization

optimize(index, search_df)

Parameters:

index¶

make_index_config(df, id_col='_id') → sparkly.index.index_config.IndexConfig¶

create the starting index config which can then be used to for optimization throws out any columns where the average number of whitespace delimited tokens are >= 50

Parameters:	df : pyspark.sql.DataFrame the dataframe that we want to generate a config for id_col : str the unique id column for the records in the dataframe

optimize(index: sparkly.index.index_base.Index, search_df) → sparkly.query_generator.query_spec.QuerySpec¶

Parameters:	index : Index the index that will have an optimzed query spec created for it search_df : pyspark.sql.DataFrame: the records that will be used to choose the query spec
Returns:	QuerySpec a query spec optimized for searching for search_df using index

sparkly.index_optimizer.query_scorer module¶

class sparkly.index_optimizer.query_scorer.AUCQueryScorer¶

Bases: sparkly.index_optimizer.query_scorer.QueryScorer

Methods

score_query_result
score_query_results

score_query_result(query_result, query_spec, drop_first) → float¶

score_query_results(query_results, query_spec, drop_first) → list¶

class sparkly.index_optimizer.query_scorer.QueryScorer¶

Bases: abc.ABC

Methods

score_query_results(query_results, query_spec)

score_query_result

score_query_result(query_result, query_spec) → float¶

score_query_results(query_results, query_spec) → list¶

class sparkly.index_optimizer.query_scorer.RankQueryScorer(threshold, k)¶

Bases: sparkly.index_optimizer.query_scorer.QueryScorer

Methods

score_query_result
score_query_results

score_query_result(query_result, query_spec) → float¶

score_query_results(query_results, query_spec) → list¶

sparkly.index_optimizer.query_scorer.compute_wilcoxon_score(x, y)¶

sparkly.index_optimizer.query_scorer.score_query_result(scores, drop_first=False)¶

sparkly.index_optimizer.query_scorer.score_query_result_sum(scores)¶

sparkly.index_optimizer.query_scorer.score_query_results(query_results)¶

sparkly.index_optimizer package¶

Submodules¶

sparkly.index_optimizer.index_optimizer module¶

sparkly.index_optimizer.query_scorer module¶

Module contents¶