AWS CloudSearch is the hosted, as-a-service, version of Apache SOLR which is built on Apache Lucene.
Overall
CloudSearch offers many types of search including full text, boolean, prefix and range. In addition, you can use term boosting, faceting, highlighting, and enable autocomplete as well. Normal types of files (html, pdf, MS document) can be searched as can DynamoDB tables.
What is a bit different from other search offering is the data load process. Instead of the system indexing data in a series of paths, data is uploaded to a search domain location defined by CloudSearch then indexed.
Integration with IAM - You control access to the Amazon CloudSearch configuration service APIs and the domain services, which control the use of the domain, APIs independently.
Scaling is automatic based on data and search traffic but can manually scale out as well. Multi-AZ is available as well.
Setup
-
Create domain
-
Enable access using access policy (private by default)
-
Set Instance type, desired replica count (for stuff bigger than
search.m3.2xlarge
) and partition count -
Upload content in batches of less than 5 mb
Setup Indexing
Use the aws cloudsearch define-index-field
to manually setup index field or cs-configure-from-batches
to automatically setup index fields.
Content Update
The model is to upload data into CloudSearch, so if data changes it must be resubmitted to CloudSearch. A document batch is a collection of add and delete operations that represent the documents you want to add, update, or delete from your domain. Batches can be described in either JSON or XML… Maximize batches to get update performance and the max size for a batch is 5 mb…
Use
Amazon CloudSearch configuration service APIs and the domain services APIs are independently packaged.
As normal, when accessing via the CLI or an SDK requests are signed and this saves back and forth authentication traffic.
Filtering is efficient and does not contribute to ranking.
There is no way to easily delete all the documents in a domain and must be re-indexed to scale down.
Scaling
Clusters start up on search.m3.small
instances and scale up for increased load, speed requests, increased size and improving fault tolerance. Multi-AZ increases fault tolerance but doubles the cost.
Use manual scaling for data load and query spikes and realize that setting that update-scaling-parameters
sets the baseline.
ElasticSearch vs CloudSearch
Feature | ElasticSearch | CloudSearch |
---|---|---|
Underlying | Lucene/Elastic Search | SOLR |
Interface | HTTP endpoint | SDK/CLI |
HA | Single AZ | Multi AZ |
Update | incremental-pull | batch-push |
Trouble Shooting
-
504
or507
errors will occur if the batch size is at a rate too high or too large; Use the CLI for batches bigger than 5 mb. -
507
errors can also be a general service overload condition; scale out manually -
409
errors are generally service resource limits; contact AWS -
Reduce hit size by querying after 2 characters in the UI, use stopwords list