SolrCloud (aka Apache Solr)

Apache SolrSolrCloud is the name of a set of new distributed capabilities in Solr.  New Solr provides the highly scalable, fault tolerant, distributed indexing & search capabilities, near real-time search, centralized cluster configuration & management.

Distributed Search – Terminology

Cluster: Set of Solr nodes managed as a single unit

Node: A JVM instance running Solr

Partition: Subset of the entire document collection, It just a single solr core

Shard: A Partition needs to be stored in multiple nodes as specified by the replication factor. All these nodes collectively form a shard. A node may be a part of multiple shards

Leader: Each Shard has one node identified as its leader. All the writes for documents belonging to a partition routed through the leader

Collection: One or more Shards/Replicas are arranged into Collection

Replication Factor: Minimum number of copies of a document maintained by the cluster

Transaction Log: An append-only log of write operations maintained by each node


Bird Eye: SolrCloud vs Solr

  • SolrCloud or Classic Solr determined by zkHost parameter on startup
  • Zookeeper, stores coordination and configuration information for the cluster
  • The Solr configuration files like schema.xml and solrconfig.xml are stored in ZooKeeper, instead of File System
  • SolrCloud eliminates the master-slave specifics, and automates both update and search seamlessly
  • Narrowed solr configuration, just solr.xml requires in the solr cores. The other solr configuration’s is read from Zookeeper
  • Collection API’s and CoreAdmin API’s for manage & create one
  • Update log for durability and recovery, the special _version_ field to help with some of the these points, the coordination (election of overseer & shard leaders)
  • Replication factor comes into play – minimum no. of copies of a document maintained by the cluster.
  • Near Realtime search support through soft commits and handler ‘/get‘ return latest stored field values.  Note: relies on the updateLog feature
  • Distributed search across collections as long as they are compatible (for example: across geo’s search – multiple collections)
  • Queries sent to any node automatically perform a full distributed search across the cluster with load balancing and fail-over (also we can make of shards.tolerant=true to return just the documents that are available in the shards that are still alive)
  • Updates sent to any node in the cluster and are automatically forwarded to the correct shard and replicated to multiple nodes for redundancy.  Typically suggested to send to Leader directly, for that make use of ‘Smart SolrJ client (CloudSolrServer) that knows to send documents only to the shard leaders’
  • Improved Admin Interface for management and error reporting (its for both classic Solr & SolrCloud)
  • Brings new terminology’s in to Solr world, as mentioned in the terminology section
  • New configurable ScriptUpdateProcessor for DIH – manipulate documents on-the-fly
  • Brings new admin UI customization using admin-extra.menu-top.html and admin-extra.menu-bottom.html
  • New _version_ field (updateLog depend on this field) for Near Real-time get/partial documents update
  • A new spellchecker implementation was introduced solr.DirectSolrSpellchecker; it allows you to use main index to provide spelling suggestions and didn’t need to be rebuilt after every commit


SolrCloud Article’s

Zookeeper Cluster (Multi-Server) Setup of-course Solr comes with embedded ZooKeeper, however not recommended for Production use.

Zookeeper Cluster (Multi-Server) - Deployment DiagramScreenshot: zk-server-1 directory structure along with conf/zoo.cfg

SolrCloud Cluster (Single Collection) Deployment – here we will deploy SolrCloud Cluster in medium complexity of 3 Shard(s) with replica(s) [replication factor 3], 3 Instances of Tomcat Servers for Solr Node(s) and 5 Replicated servers in ZooKeeper ensemble.

Upgrade/Migration of Solr 3.x to Solr 4 – here we will get into steps involved in migrating Classic Solr 3.x into brand new Solr 4 aka SolrCloud.

SolrCloud Cluster (Multiple Collection) Deployment – here we will go further deep into SolrCloud complex deployment of 2 Collections (Data Center 1 & 2), 3 shard(s) with replica(s)[replication factor 3], 6 Instances of Tomcat Servers and 9 Replicated servers in Zookeeper ensemble.


Notable Improvements in Versions

  • XML configuration parsing is now more strict about situations where a single setting is allowed but multiple values are found. In the past, one value would be chosen arbitrarily and silently. Starting with 4.5, configuration parsing will fail with an error in situations like this. If you see error messages such as solrconfig.xml contains more than one value for config path: XXXXX or Found Z configuration sections when at most 1 is allowed matching expression: XXXXX check your solrconfig.xml file for multiple occurrences of XXXXX and delete the ones that you do not wish to use.
  • Allow multiple threads to be specified for faceting. When threading, one can specify facet.threads to parallelize loading the uninverted fields. In at least one extreme case this reduced warmup time from 20 seconds to 3 seconds.
  • In the past, schema.xml parsing would silently ignore default or required options specified on <dynamicField/> declarations. Begining with 4.5, attempting to do configured these on a dynamic field will cause an init error. If you encounter one of these errors when upgrading an existing schema.xml, you can safely remove these attributes, regardless of their value, from your config and Solr will continue to bahave exactly as it did in previous versions.
  • The UniqFieldsUpdateProcessorFactory has been improved to support all of the FieldMutatingUpdateProcessorFactory selector options. The <lst named="fields"> init param option is now deprecated and should be replaced with the more standard <arr name="fieldName">.
  • The routing parameter shard.keys is deprecated as part of SOLR-5017. The new parameter name is _route_. The old parameter should continue to work for another release.
  • UpdateRequestExt has been removed as part of SOLR-4816. So use UpdateRequest instead.
  • CloudSolrServer can now use multiple threads to add documents by default. This is a small change in runtime semantics when using the bulk add method - you will still end up with the same exception on a failure, but some documents beyond the one that failed may have made it in. To get the old, single threaded behavior, set parallel updates to false on the CloudSolrServer instance.
  • CloudSolrServer can now route updates locally and no longer relies on inter-node update forwarding.

  • A Schemaless Solr arrived. An example config set for schemaless mode is here solr/example/example-schemaless/
  • Dynamically add fields to schema. Extended FieldMutatingUpdateProcessor.ConfigurableFieldNameSelector to enable checking whether a field matches any schema field. To select field names that don't match any fields or dynamic fields in the schema, add <bool name="fieldNameMatchesSchemaField">false</bool> to an update processor's configuration in solrconfig.xml.
  • SolrJ's SolrPing object has new methods for ping, enable, and disable.
  • Fix applied, handling of <mergePolicy> init arg useCompoundFile needed after changes in LUCENE-5038
  • Properties files by Solr are now written in UTF-8 encoding, Unicode is no longer escaped. Reading of legacy properties files with u escapes is still possible.

  • Provided REST API read access to all elements of the live schema. Added a REST API request to return the entire live schema, in JSON, XML, and schema.xml formats. Move REST API methods from package org.apache.solr.rest to org.apache.solr.rest.schema, and rename base functionality REST API classes to remove the current schema focus, to prepare for other non-schema REST APIs. Change output path for copyFields and dynamicFields from "copyfields" and "dynamicfields" (all lowercase) to copyFields and dynamicFields, respectively, to align with all other REST API outputs, which use camelCase.
  • In preparation for REST API requests that can modify the schema, a managed schema is introduced. Added <schemaFactory mutable="true"/> to solrconfig.xml in order to use it, and to enable schema modifications via REST API requests.
  • A new collections api to add additional shards dynamically by splitting existing shards.
  • Discovering SolrCores by directory structure rather than defining them in solr.xml.  Also, change the format of solr.xml to be closer to that of solrconfig.xml. This version of Solr will ship the example in the old style, but you can manually try the new style. Solr 4.4 will ship with the new style, and Solr 5.0 will remove support for the old style.
  • HttpSolrServer sends the stream name and exposes useMultiPartPost
  • Fixed group.facet=true to work with negative facet.limit
  • The hardcoded SolrCloud defaults for hostContext="solr" and hostPort="8983" have been deprecated and will be removed in Solr 5.0. Existing solr.xml files that do not have these options explicitly specified should be updated accordingly.

  • Added REST API methods, via Restlet integration, for reading schema elements, at /schema/fields/, /schema/dynamicfields/, /schema/fieldtypes/, and /schema/copyfields/
  • New SweetSpotSimilarityFactory allows customizable TF/IDF based Similarity when you know the optimal Sweet Spot of values for the field length and TF scoring factors.
  • SolrJ, and SolrCloud internals, now use SystemDefaultHttpClient under the covers -- allowing many HTTP connection related properties to be controlled via 'standard' java system properties.
  • CurrencyField's OpenExchangeRatesOrgProvider now requires a ratesFileLocation init param, since the previous global default no longer works.
  • Sort directions (asc, desc) are now case insensitive
  • Collection Aliasing. the search side will allowing mapping a single alias to multiple collections, but the index side will only support mapping a single alias to a single collection.

  • ExternalFileField caches can be reloaded on firstSearcher/ newSearcher events using the ExternalFileFieldReloader - implementation.
  • Collection specific document routing. The "compositeId" router is the default for collections with hash based routing (i.e. when numShards=N is specified on collection creation). Documents with ids sharing the same domain (prefix) will be routed to the same shard, allowing for efficient querying.
  • Added <propertyWriter /> element to DIH's data-config.xml file, allowing the user to specify the location, filename and Locale for the data-config.properties file. Alternatively, users can specify their own property writer implementation for greater control. This new configuration element is optional, and defaults mimic prior behavior. The one exception is that the root locale is default. Previously it was the machine's default locale.
  • Fix SolrCloud behavior when using hostContext containing _ or / characters. This fix also makes SolrCloud more accepting of hostContext values with leading/trailing slashes.

I will see you around, enjoy the article. Have queries? hmm, post it here!