Monday, December 24, 2018

Searching JSON: comparing text search in Couchbase and MongoDB.



“Apps without search is like Google homepage without the search bar.”
"Apps without search is like Google homepage without the search bar."
Image title
It's hard to design an app without a good search. These days, it's also hard to find a database without a built-in search. MySQL to NoSQL, Sybase to Couchbase, every database has text search support — built-in like Couchbase or via integration to Elastic — as is the case in Cassandra. Unlike SQL, text search functionality isn't standardized. Every application needs best of the breed search, but not every database provides the same text search functionality. It's important to understand the available feature, performance of each text search implementation and choose what fits your application need. After motivating text search, you'll learn about the text search features you'd need for an effective, compare and contrast those features in MongoDB and Couchbase with examples.
Let's look at the application level search requirements.
Exact Search: (WHERE item_id = "ABC482")
Range Search: (WHERE item_type = "shoes" and size = 6 and price between 49.99 and 99.99)
String search:
  • (WHERE lower(name) LIKE "%joe%")
  • (WHERE lower(name) LIKE "%joe%" AND state = "CA")
  • Find joe in any field within the JSON document
  • Find documents matching phone number (408-956-2444) in any format (+1 (408) 956-2444, +1 510.956.2444, (408) 956 2444)
Document search:
Complex search: (WHERE lower(title) LIKE "%dictator%" and lower(actor) LIKE "%chaplin" and year < 1950)
Range searches in cases (1) and (2) can be handled with typical B-Tree indexes efficiently. The data is well organized by the full data you're searching for. When you start to look for the word fragment "joe" or match phone numbers with various patterns in a larger document, B-Tree based indexes suffer. Simple tokenizations and using B-Tree based indexes can help in simple cases. You need new approaches to your real-world search cases.
The appendix section of this blog has more details on how the inverted tree indexes are organized and why they're used for the enterprise search in Lucene and Bleve. Bleve powers the Couchbse full-text search. MongoDB uses B-Tree based indexes even for text search.
Let's now focus on the text search support in MongoDB and Couchbase.
MongoDB Text Search Overview: Create and query text search index on strings of MongoDB documents. The index seems to be simple B-tree indexes with additional layers for the built-in analyzer. This comes with a lot of sizing and performance issues we'll discuss further. The text search index is tightly integrated into the MongoDB database infrastructure and its query API.
MongoDB provides text indexes to support text search queries only on strings. Its text indexes can include only fields whose value is a string or an array of string elements. A collection can only have one text search index, but that index can cover multiple fields.
Couchbase FTS (Full-Text Search) Overview: Full-Text Search provides extensive capabilities for natural-language querying. Bleve, implemented as an inverted index, powers the Couchbase full-text index. The index is deployed as one of the services and can be deployed on any of the nodes in the cluster.
  1. Category: Name
    • MongoDB 4.x: Text search
    • Couchbase 6.x: Full Text search
  2. Category: Functionality
    • MongoDB: Simple text search to index string fields and search for a string in one or more string fields only. Uses its B-Tree indexes for the text search index.
      Search on the whole composite string and cannot separate the specific fields.
    • Couchbase: Full-text search to find anything in your data. Supports all JSON data types (string, numeric, boolean, date/time); query supports complex boolean expressions, fuzzy expressions on any type of fields. Uses the inverted index for the text search index.
  3. Category: Installation
    • MongoDB: Text search: Available with MongoDB installation. No separate installation option.
    • Couchbase: Available with Couchbase installation. Can be installed with other services (data, query, index, etc) or installed separately on distinct search nodes.
  4. Category: Datatypes Supported
  5. Category: Index creation on a single field
    • MongoDB: db.films.createIndex({ title: “text” });
    • Couchbase:
     curl -u Administrator:password -XPUT 
    http://localhost:8094/api/index/films_title -H ‘cache-control: no-cache’ 
           -H ‘content-type: application/json’ 
           -d { “name”: “films_title”, “type”: “fulltext-index”, “params”: 
                { “mapping”: { “default_field”: “title” } }, 
                “sourceType”: “couchbase”, “sourceName”: “films” }
  6. Category: Index creation on multiple fields
    • MongoDB: db.films.createIndex({ title: “text”, genres: “text”});
      Before you can create this index, you’ve to drop the previous index. There can be only one text index on a collection. You need its name, which you get by: db.films.getIndexes() or specify the name while creating the index.
      db.films.dropIndex(“title_text”);
    • Couchbase: You can create as multiple indexes on a bucket (or keyspace) without restriction.
    curl -u Administrator:password 
    -XPUT http://localhost:8094/api/index/films_title_genres 
        -H ‘cache-control: no-cache’ 
        -H ‘content-type: application/json’       
        -d { “name”: “films_title_genres”, “type”: “fulltext-index”, “params”: {    “mapping”: { “types”: { “genres”: { “enabled”: true, “dynamic”: false }, “title”: { “enabled”: true, “dynamic”: false }}}}, “sourceType”: “couchbase”, “sourceName”: “films” }
  7. Category: Boosting (or using Weights)
    • MongoDB: db.films.createIndex({ title: “text”, genres: “text”}, {weights:{title: 25}, name : “txt_title_genres”});
    • Couchbase: Done dynamically via boosting using the ^ mofidier.
      curl -XPOST 
          -H “Content-Type: application/json” \ http://172.23.120.38:8094/api/index/films_title_genres/query \ 
           -d { “explain”: true, “fields”: [ * ], “highlight”: {}, “query”: { “query”: “title:charlie^40 genres:comedy^5 } }
  8. Category: Language Option
    • MongoDB: Default language is English. Pass in a parameter to change that.
      db.films.createIndex({ title: “text”}, { default_language: “french” });
    • Couchbase: Analyzers are available in 24 languages. You can change is while creating the index by changing the following parameter.
      “default_analyzer”: “fr”
  9. Category: Case Insensitive Text Index
    • MongoDB: Case insensitive by default. Extended to new languages.
    • Couchbase: Case insensitive by default.
  10. Category: diacritic insensitive
    • MongoDB: With version 3, the text index is diacritic insensitive.
    • Couchbase: Yes. Automatically enabled in the appropriate analyzer (e.g. French)
  11. Category: Delimiters
    • MongoDB: Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space
    • Couchbase: Each work is analyzed based on the language and analyzer specification.
  12. Category: Languages
    • MongoDB: 15 languages:
      Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish
    • Couchbase: Token filters are supported for the following languages.
      Arabic, Catalan, Chinese , Japanese , Korean, Kurdish, Danish, German, Greek, English, Spanish (Castilian), Basque, Persian, Finnish, French, Gaelic, Spanish (Galician), Hindi, Hungarian, Armenian, Indonesian, Italian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Turkish
  13. Category: Type of Index
    • MongoDB: Simple B-Tree index containing on entry for each stemmed word in each document.
      text indexes can be large. They contain one index entry for each unique post-stemmed word in each indexed field for each document inserted.
    • Couchbase: Inverted index. One entry per stemmed word in the WHOLE index (per index partition). So, the index sizes are significantly smaller index. The more humongous the data set is, Couchbase FTS index is that much more efficient compared to MongoDB text index.
  14. Category: Index creation effect on INSERTS.
    • MongoDB: Will negatively affect the INSERT rate.
    • Couchbase: INSERT/UPSERT rates will remain unaffected
  15. Category: Index Maintenance
    • MongoDB: Synchronously Maintained.
    • Couchbase: Asynchronously maintained. Queries can specify the staleness using the consistency parameter.
  16. Category: Phrase Queries
    • MongoDB: Supported, but slow.
      Phrase searches slow since the text index does not include the required metadata about the proximity of words in the documents. As a result, phrase queries will run much more effectively when the entire collection fits in RAM.
    • Couchbase: Supported and fast.
      Include the term vectors during index creation.
  17. Category: Text Search
    • MongoDB: db.films.find({$text: {$search: “charlie chaplin”}})
      This find all the documents that contain charlie OR chaplin. Having both charlie and chaplin will get higher score. Since there can be only ONE text index per collection, this query uses that index irrespective of the field it indexes. So, it’s important to decide which of the fields should be in the index.
    • Couchbase:
    Very Flexible text search. 
    curl -XPOST -H "Content-Type: application/json" \
    http://172.23.120.38:8094/api/index/films_title_genres/query \
    -d '{
      "query": {
        "query": "charlie chaplin"
      }
    }'
  18. Category: Exact phrase search
    • MongoDB: db.films.find({$text: {$search: “\”charlie chaplin\””}})
    • Couchbase:
    curl -XPOST -H "Content-Type: application/json" \
    http://172.23.120.38:8094/api/index/films_title_genres/query \
    -d '{
      "query": {
        "query": "\"charlie chaplin\""
      }
    }'
  19. Category: Exact Exclusion
    • MongoDB: db.films.find({$text: {$search: “charlie -chaplin”}});
      All the movie with “charlie”, but without “chaplin”.
    • Couchbase:
    curl -XPOST -H "Content-Type: application/json" \
    http://172.23.120.38:8094/api/index/films_title_genres/query \
    -d '{
      "query": {
        "query": "charlie -chaplin"
      }
    }'
  20. Category: Results Ordering
    • MongoDB: Unordered by default.
      Project and sort by score when you need it.
      db.films.find({$text: {$search: “charlie chaplin”}}, {score: {$meta: “searchscore”}}).sort({$meta: “searchscore”})
    • Couchbase
    curl -XPOST -H "Content-Type: application/json" \
    http://172.23.120.38:8094/api/index/films_title_genres/query \
    -d '{
      "sort":["title", "-_score"]
      "query": {
        "query": "charlie -chaplin"
      }
    }'
  21. Category: Specific Language Search
    • MongoDB: db.articles.find(
      { $text: { $search: “leche”, $language: “es” } }
      )
    • Couchbase: The language analyzer will have determined the characteristics of the index and query.
  22. Category: Case Insensitive Search
  23. Category: LIMIT and OFFSETS(Pagination)
    • MongoDB: db.films.find({$text: {$search: “charlie chaplin”}},{score: {$meta: “searchscore”}}).sort({$meta: “searchscore”}).limit(10)
    • Couchbase: Supports the equivalant of LIMIT and SKIP in SQL using “size” and “from” parameters respectively.
    curl -XPOST -H "Content-Type: application/json" \
    http://172.23.120.38:8094/api/index/films_title_genres/query \
    -d '{
      "explain": true,
      "fields": [
        "*"
      ],
      "highlight": {},
      "query": {
        "query": "charlie chaplin"
      }, 
      "size":10,
      "from":40
    }'
  24. Category: Complex Sorting
    • MongoDB: db.films.find({$text: {$search: “charlie chaplin”}},
      {score: {$meta: “searchscore”}}).sort({year : 1, $meta: “searchscore”}).limit(10)
    • Couchbase: Ordered by score (descending) by default. Can order by any field or meta data. This sorts by title (ascending), year (descending) and score (descending)
    curl -XPOST -H "Content-Type: application/json" \
    http://172.23.120.38:8094/api/index/films_title_genres/query \
    -d '{
      "sort":["title", "-year", "-_score"]
      "query": {
        "query": "charlie -chaplin"
      }
    }'
  25. Category: Complex Query
    • MongoDB: Use the aggregation framework. $text search can be used in an aggregation framework with some restrictions.
      • db.articles.aggregate(
        [
        { $match: { $text: { $search: “charlie chaplin” } } },
        { $project: { title: 1, _id: 0, score: { $meta: “searchscore” } } },
        { $match: { score: { $gt: 5.0 } } }
        ]
        )
      • Limitations: https://docs.mongodb.com/manual/tutorial/text-search-in-aggregation/
    • Couchbase: As you’ve seen so far, FTS query itself is pretty sophisticated. In addition, FTS supports facets for simple grouping and counting. https://docs.couchbase.com/server/6.0/fts/fts-response-object-schema.html
      In the upcoming release, N1QL (SQL for JSON) will use the FTS index for search predicates.
    SELECT state, sum(sales)
    FROM store_sales s
    WHERE search(s.title, "lego", "fts_title")
    GROUP BY state
  26. Category: Full Document Index
    • MongoDB: Does not support full document indexing. All the string fields will have to be specified in the createIndex call.
      db.films.createIndex({ title: “text”, generes: “text”, cast: “text”, year: “text”});
    • Couchbase: By default, it supports indexing the full document, automatically recognizes the type of the field and indexes them accordingly.
  27. Category: Query Types
    • MongoDB: Basic search, must have, must not have.
    • Couchbase: Match, Match Phrase, Doc ID, and Prefix queries
      Conjunction, Disjunction, and Boolean field queries
      Numeric Range and Date Range queries
      • Geospatial queries
        Query String queries, which employ a special syntax to express the details of each query (see Query String Query for information)
  28. Category: Analyzers
  29. Category: Create and search via UI
    • MongoDB: Not in the base product.
    • Couchbase: Built into Console
  30. Category: REST API
    • MongoDB: Unavailable
  31. Category: SDK
  32. Category: Term Vectors
    • MongoDB: Unsupported
    • Couchbase: Available. Term vectors are very useful in a phrase search.
  33. Category: Faceting
  34. Category: Advanced AND queries (conjuncts) using the text search
    • MongoDB: Unsupported
    • Couchbase:
    curl -u Administrator:password -XPOST 
    -H “Content-Type: application/json” http://172.23.120.38:8094/api/index/filmsearch/query -d ‘{
     “query”: {
       “conjuncts”:[ { “field”:”title”, “match”:”kid”}, 
                     {“field”:”cast”, “match”:”chaplin”}]
     }
    }
  35. Category: Advanced OR queries (disjuncts)
    • MongoDB: Unsupported
    • Couchbase:
    curl -u Administrator:password -XPOST 
    -H “Content-Type: application/json” http://172.23.120.38:8094/api/index/filmsearch/query -d ‘{
     “query”: {
       “disjuncts”:[ { “field”:”title”, “match”:”kid”}, 
                    {“field”:”cast”, “match”:”chaplin”}]
     }
    }
  36. Category: Date range queries
    • MongoDB: Unsupported.
      Needs post-processing, which could affect the performance.
    • Couchbase:
    Supported
    {
    “start”: “2001-10-09T10:20:30-08:00,
    “end”: “2016-10-31,
    “inclusive_start”: false,
    “inclusive_end”: false,
    “field”: “review_date”
    }
  37. Category: Numerical Range Queries
    • MongoDB: Unsupported
    • Couchbase:
     curl -u Administrator:password -XPOST -H “Content-Type: application/json” http://172.23.120.38:8094/api/index/filmsearch/query -d ‘{
    “explain”: true,
    “query”: {
      “field”:”year”, “min”:1999, “max”:1999, 
      “inclusive_min”: true, “inclusive_max”:true
    }
    }

Performance

While an elaborate performance comparison is still pending, we did a quick comparison with 1 million documents from wikipedia. Here’s what we saw:
Index Sizes.

MongoDB (4.x)
Couchbase (6.0)
Indexing Size
  1. 1.6 GB
1 GB (scorch)
Indexing Time
450 seconds
46 seconds
Search Query Throughput (queries per second):
MongoDB
Couchbase
High Frequency
Terms
79
395
Medium Frequency Terms
201
  1. 6396
Low Frequency
Terms
643
  1. 24600
High OR High terms
82
145
High OR Medium terms
78
258
Phrase search
50
  1. 107

Summary

  1. MongoDB provides simple string-search index and APIs to do string search. The B-tree index it creates for string search also be quite huge. Text search, it is not.
  2. Couchbase text index is based on inverted index and is a full text index with a significantly more of features and better performance.
Why Inverted Index for search index?
Simple exact and range searches can be powered by B-Tree like indexes for an efficient scan. Text searches, however, have wider requirement of stemming, stopwords, analyzers, etc. This requires not only a different indexing approach but also pre-index filtering, custom analysis tools, language specific stemming and case insensitivities.
Search index can be created using traditional B-TREE. But, unlike a B-tree index on scalar values, text index will have multiple index entries for each document. A text index on this document alone could have up to 12 entries: 8 for cast names, one for genres, two for the title after removing the stop word (in) and the year. Larger documents and document counts will increase the size of the text index exponentially.
 {
      "cast": [
        "Whoopi Goldberg",
        "Ted Danson",
        "Will Smith",
        "Nia Long"
      ],
      "genres": [
        "Comedy"
      ],
      "title": "Made in America",
      "year": 1993
    }
  }
Solution: Here comes the inverted tree. The inverted tree has the data (search term) at the top (root) and has various document keys in which the term exists at the bottom, making the structure look like an inverted tree. Popular text indexes in LuceneBleve are all implemented as inverted indexes.

Image result for inverted tree

Searching JSON: comparing text search in Couchbase and MongoDB.

“Apps without search is like Google homepage without the search bar.” "Apps without search is like Google homepage withou...