flaxsearch/solr-es-comparison

语言: XSLT

git: https://github.com/flaxsearch/solr-es-comparison

README.md (中文)

SolrCloud / Elasticsearch比较工具

此存储库包含Flax为我们使用的各种Python脚本和配置文件 在BCS搜索解决方案中展示的Solr和Elasticsearch的性能比较 2014年11月的活动 - 幻灯片位于:

HTTP://呜呜呜.slide share.net/Charlie juggler/Lucene-so LR London UG-meet up28Nov2014-so猎人-饿死-performance

这些文件仅供参考,我们不对其用途提出任何要求 任何其他申请。

为了提供完全“公平”的比较,使用完全相同的文档集 对于Solr和Elasticsearch索引。避免下载涉及的开销 一个大型文档集,我们改为使用马尔可夫链(和Python实现) Shabda Raaj)从训练文档中生成各种大小的随机文档。 我们的研究使用data / stoicism.txt(从gutenberg.org下载)进行培训,但是没有 合理大小的“正常”文本,应该可用于此。目前有一件事 不清楚这种方法与真实文档相比有多现实,但Elasticsearch 和索尔至少收到了相同的数据。分析还显示了马尔可夫产生的 文本(如自然文本)遵循Zipf关于文字分布的定律,它支持它 有效性。

生成随机文档

generate / generator.py脚本用于生成索引的随机文档, 它保存为gzip文件。它需要以下参数:

-h, --help  show this help message and exit
-n N        number of documents to generate
-o O        output filename
-i I        training text
--min MIN   minimum doc size in words
--max MAX   maximum doc size in words

例如,基于以下内容创建大小在10到1000字之间的1M随机文档 数据/ stoicism.txt:

$ cd generate
$ python -n 1000000 -i ../data/stoicism.txt -o ../data/docs.gz --min 10 --max 1000

索引到Elasticsearch

在建立索引之前,您需要配置索引,例如卷曲:

$ cd elasticsearch
$ curl -XPUT http://localhost:9200/speedtest -d@index-config.json

(将localhost:9200替换为您的Elasticsearch实例的位置)。然后编辑 indexer.py脚本并将ES_URL设置为指向speedtest索引。

$ time python indexer.py ../data/docs.gz A

第二个参数(在本例中为A)用作ID前缀。您可以运行多个索引器 并行使用不同的ID前缀来防止ID冲突。

索引到索尔

solr中提供了solr conf目录。您需要将其上传到SolrCloud 使用常用方法(或单个节点Solr,将其复制到默认配置)。该 需要编辑indexer.py脚本以将SOLR_URL指向正确的位置。然后, 索引器的运行方式与Elasticsearch索引器的运行方式相同。

运行搜索测试

主测试脚本是loadtest中的loadtester.py。它需要参数:

-h, --help   show help message and exit
--es ES      Elasticsearch search URL
--solr SOLR  Solr search URL
-i I         input file for words
-o O         output file
--ns NS      number of searches (default is 1)
--nt NT      number of terms (default is 1)
--nf NF      number of filters (default is 0)
--fac        use facets

例如:

$ python loadtester.py \
    --solr "http://localhost:8983/solr/collection1/query" \
    -i ../data/stoicism.txt -o test1.txt --ns 100 --nt 3

输出只是一个文本文件,其中每行记录找到的文档数 和查询时间。要对结果进行一些基本分析:

$ python analyser.py test1.txt

merge2.py和merge3.py脚本可用于合并两个或的查询时间 三个结果文件并将它们写为.cvs文件,以便导入电子表格等。

qps.py脚本重复运行搜索并将QPS打印到stdout。多 实例可以同时运行以增加负载(没有多线程, 目前)。

本文使用googletrans自动翻译,仅供参考, 原文来自github.com

en_README.md

SolrCloud/Elasticsearch comparison tools

This repository contains various Python scripts and config files used by Flax for our
performance comparison of Solr and Elasticsearch, presented at the BCS Search Solutions
event in November 2014 - the slides for which are at:

http://www.slideshare.net/charliejuggler/lucene-solrlondonug-meetup28nov2014-solr-es-performance

These files are provided for interest only, and we make no claims about their usefulness
for any other application.

In order to provide a completely "fair" comparison, the exact same document set is used
for both the Solr and Elasticsearch indexes. To avoid the overhead involved in downloading
a large document set, we instead used a Markov chain (and a Python implementation by
Shabda Raaj) to generate random documents of various sizes from a training document.
Our study used data/stoicism.txt (downloaded from gutenberg.org) for training, but any
"normal" text of reasonable size and should be usable for this. One thing that is currently
unclear is how realistic this approach is compared with real documents, but Elasticsearch
and Solr did at least receive the same data. Analysis also showed that the Markov-generated
text (like natural text) obeyed Zipf's Law on word distribution, which supports its
validity.

Generating random documents

The generate/generator.py script is used to generate random documents for indexing,
which it saves as a gzip file. It takes the following arguments:

-h, --help  show this help message and exit
-n N        number of documents to generate
-o O        output filename
-i I        training text
--min MIN   minimum doc size in words
--max MAX   maximum doc size in words

e.g., to create 1M random documents ranging in size between 10 and 1000 words, based on
data/stoicism.txt:

$ cd generate
$ python -n 1000000 -i ../data/stoicism.txt -o ../data/docs.gz --min 10 --max 1000

Indexing to Elasticsearch

Before indexing, you need to configure the index, e.g. with curl:

$ cd elasticsearch
$ curl -XPUT http://localhost:9200/speedtest -d@index-config.json

(replacing localhost:9200 with the location of your Elasticsearch instance). Then edit the
indexer.py script and set ES_URL to point to the speedtest index.

$ time python indexer.py ../data/docs.gz A

The second parameter (A in this case) is used as an ID prefix. You can run several indexers
in parallel, using different ID prefixes to prevent ID clashes.

Indexing to Solr

A solr conf directory is provided in solr. You will need to upload this to SolrCloud
using the usual methods (or for single node Solr, copy it over the default config). The
indexer.py script needs to be edited to point SOLR_URL to the correct location. Then,
the indexer is run in the same way as the Elasticsearch indexer.

Running the search tests

The main test script is loadtester.py in loadtest. It takes the arguments:

-h, --help   show help message and exit
--es ES      Elasticsearch search URL
--solr SOLR  Solr search URL
-i I         input file for words
-o O         output file
--ns NS      number of searches (default is 1)
--nt NT      number of terms (default is 1)
--nf NF      number of filters (default is 0)
--fac        use facets

For example:

$ python loadtester.py \
    --solr "http://localhost:8983/solr/collection1/query" \
    -i ../data/stoicism.txt -o test1.txt --ns 100 --nt 3

The output is simply a text file where each line records the number of documents found
and the query time. To get some basic analysis of the results:

$ python analyser.py test1.txt

The merge2.py and merge3.py scripts can be used to merge the query times of two or
three results files and write them as a .cvs file for importing into a spreadsheet etc.

The qps.py script runs searches repeatedly and prints the QPS to stdout. Multiple
instances can be run concurrently to increase the load (there is no multithreading,
currently).