Solr: a custom Search RequestHandler

As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework around it, which has served us well so far. The rationale for moving to Solr is driven primarily by the need to expose our search tier as a service for our internal applications. While it would have been relatively simple (probably simpler) to slap on an HTTP interface over our current search tier, we also want to use the other Solr features such as incremental indexing and replication.

One of our challenges to using Solr is that the way we do search is quite different from the way Solr does search. A query string passed to the default Solr search handler is parsed into a Lucene query and a single search call is made on the underlying index. In our case, the query string is passed to our taxonomy, and depending on the type of query (as identified by the taxonomy), it is sent through one or more sub-handlers. Each sub-handler converts the query into a (different) Lucene query and executes the search against the underlying index. The results from each sub-handler are then layered together to present the final search result.

Conceptually, the customization is quite simple - simply create a custom subclass of RequestHandlerBase (as advised on this wiki page) and override the handleRequestBody(SolrQueryRequest, SolrQueryResponse) method. In reality, I had quite a tough time doing this, admittedly caused (at least partly) by my ignorance of Solr internals. However, I did succeed, so, in this post, I outline my solution, along with some advice I feel would be useful to others embarking on a similar route.

Configuration and Code

The handler is configured to trigger in response to a /solr/mysearch request. Here is the (rewritten for readability) XML snippet from my solrconfig.xml file. I used the "invariants" block to pass in configuration parameters for the handler.

  ...
  <requestHandler name="/mysearch" 
      class="org.apache.solr.handler.ext.MyRequestHAndler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*,score</str>
      <str name="wt">xml</str>
    </lst>
    <lst name="invariants">
      <str name="prop1">value1</str>
      <int name="prop2">value2</int>
      <!-- ... more config items here ... -->
    </lst>
  </requestHandler>
  ...

And here is the (also rewritten for readability) code for the custom handler. I used the SearchHandler and MoreLikeThisHandler as my templates, but diverged from it in several ways in order to accomodate my requirements. I will describe them below.

package org.apache.solr.handler.ext;

// imports omitted

public class MyRequestHandler extends RequestHandlerBase {

  private String prop1;
  private String prop2;
  ...
  private TaxoService taxoService;

  @Override
  public void init(NamedList args) {
    super.init(args);
    this.prop1 = invariants.get("prop1");
    this.prop2 = Integer.valueOf(invariants.get("prop2"));
    ...
    this.taxoService = new TaxoService(prop1);
  }

  @Override
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {

    // extract params from request
    SolrParams params = req.getParams();
    String q = params.get(CommonParams.Q);
    String[] fqs = params.getParams(CommonParams.FQ);
    int start = 0;
    try { start = Integer.parseInt(params.get(CommonParams.START)); } 
    catch (Exception e) { /* default */ }
    int rows = 0;
    try { rows = Integer.parseInt(params.get(CommonParams.ROWS)); } 
    catch (Exception e) { /* default */ }
    SolrPluginUtils.setReturnFields(req, rsp);

    // build initial data structures
    TaxoResult taxoResult = taxoService.getResult(q);
    SolrDocumentList results = new SolrDocumentList();
    SolrIndexSearcher searcher = req.getSearcher();
    Map<String,SchemaField> fields = req.getSchema().getFields();
    int ndocs = start + rows;
    Filter filter = buildFilter(fqs, req);
    Set<Integer> alreadyFound = new HashSet<Integer>();

    // invoke the various sub-handlers in turn and return results
    doSearch1(results, searcher, q, filter, taxoResult, ndocs, req, 
      fields, alreadyFound);
    doSearch2(results, searcher, q, filter, taxoResult, ndocs, req, 
      fields, alreadyFound);
    // ... more sub-handler calls here ...

    // build and write response
    float maxScore = 0.0F;
    int numFound = 0;
    List<SolrDocument> slice = new ArrayList<SolrDocument>();
    for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
      SolrDocument sdoc = it.next();
      Float score = (Float) sdoc.getFieldValue("score");
      if (maxScore < score) {
        maxScore = score;
      }
      if (numFound >= start && numFound < start + rows) {
        slice.add(sdoc);
      }
      numFound++;
    }
    results.clear();
    results.addAll(slice);
    results.setNumFound(numFound);
    results.setMaxScore(maxScore);
    results.setStart(start);
    rsp.add("response", results);

  }

  private Filter buildFilter(String[] fqs, SolrQueryRequest req) 
      throws IOException, ParseException {
    if (fqs != null && fqs.length > 0) {
      BooleanQuery fquery = new BooleanQuery();
      for (int i = 0; i < fqs.length; i++) {
        QParser parser = QParser.getParser(fqs[i], null, req);
        fquery.add(parser.getQuery(), Occur.MUST);
      }
      return new CachingWrapperFilter(new QueryWrapperFilter(fquery));
    }
    return null;
  }

  private void doSearch1(SolrDocumentList results,
      SolrIndexSearcher searcher, String q, Filter filter, 
      TaxoResult taxoResult, int ndocs, SolrQueryRequest req,
      Map<String,SchemaField> fields, Set<Integer> alreadyFound) 
      throws IOException {
    // check entry condition
    if (! canEnterSearch1(q, filter, taxoResult)) {
      return;
    }
    // build custom query and extra fields
    Query query = buildCustomQuery1(q, taxoResult);
    Map<String,Object> extraFields = new HashMap<String,Object>();
    extraFields.put("search_type", "search1");
    boolean includeScore = 
      req.getParams().get(CommonParams.FL).contains("score"));
    append(results, searcher.search(
      query, filter, maxDocsPerSearcherType).scoreDocs,
      alreadyFound, fields, extraFields, maprelScoreCutoff, 
      searcher.getReader(), includeScore);
  }

  // ... more doSearchXXX() calls here ...

  private void append(SolrDocumentList results, ScoreDoc[] more, 
      Set<Integer> alreadyFound, Map<String,SchemaField> fields,
      Map<String,Object> extraFields, float scoreCutoff, 
      SolrIndexReader reader, boolean includeScore) throws IOException {
    for (ScoreDoc hit : more) {
      if (alreadyFound.contains(hit.doc)) {
        continue;
      }
      Document doc = reader.document(hit.doc);
      SolrDocument sdoc = new SolrDocument();
      for (String fieldname : fields.keySet()) {
        SchemaField sf = fields.get(fieldname);
        if (sf.stored()) {
          sdoc.addField(fieldname, doc.get(fieldname));
        }
      }
      for (String extraField : extraFields.keySet()) {
        sdoc.addField(extraField, extraFields.get(extraField));
      }
      if (includeScore) {
        sdoc.addField("score", hit.score);
      }
      results.add(sdoc);
      alreadyFound.add(hit.doc);
    }
  }
  
  //////////////////////// SolrInfoMBeans methods //////////////////////

  @Override
  public String getDescription() {
    return "My Search Handler";
  }

  @Override
  public String getSource() {
    return "$Source$";
  }

  @Override
  public String getSourceId() {
    return "$Id$";
  }

  @Override
  public String getVersion() {
    return "$Revision$";
  }
}

Configuration Parameters - I started out baking most of my "configuration" parameters as constants within the handler code, but later moved them into the invariants block in the XML declaration. Not ideal, since we still need to touch the solrconfig.xml file (which is regarded as application code in our environment) to change behavior. The ideal solution, given the circumstances, would probably be to use JNDI to hold the configuration parameters and have the handler connect to the JNDI to pull the properties it needs.

Using Filter - The MoreLikeThis handler converts the fq (filter query) parameter into a List of Query objects, because this is what is needed to pass into a searcher.getDocList(). In my case, I couldn't use DocListAndSet because DocList is unmodifiable (ie, DocList.add() throws an UnsupportedOperationException). So I fell back to the pattern I am used to, which is getting the ScoreDoc[] array from a standard searcher.search(Query,Filter,numDocs) call. That is why the buildFilter() above returns a Filter and not a List<Query>.

Connect to external services - My handler needs to connect to the taxonomy service. Our taxonomy exposes an RMI service with a very rich and fine-grained API. I tried to use this at first, but ran into problems because it needs access to configuration files on the local system, and Jetty couldn't see these files because it was not within its context. I ended up solving for this by exposing a coarse grained JSON service over HTTP on the taxonomy service. The handler calls it once per query and gets back all the information that it needs in a single call. Probably not ideal, since now the logic is spread out in two places - I will probably revisit the RMI client integration again in the future.

Layer multiple resultsets - This is the main reason for writing the custom handler. Most of the work happens in the append() method above. Each sub-handler calls SolrSearcher.search(Query, Filter, numDocs) and populates its resulting ScoreDocs array into a List<SolrDocument>. Since previous sub-handlers may have already returned a result, subsequent sub-handlers check against a Set of docIds.

Add a pseudo-field to the Document - There are currently two competing initiatives in Solr (SOLR-1566 and SOLR-1298) on how to handle this situation. Since I was populating SolrDocument objects (this was one of the reasons I started using SolrDocumentList), it was relatively simple for me to pass in a Map of extra fields which are just tacked on to the end of the SolrDocument.

Some Miscellaneous advice

Here is some advice and tips which I wish someone had told me before I started out on this.

For your own sanity, standardize on a Solr release. I chose 1.4.1 which is the latest at the time of writing this. Prior to that, I was developing within the Solr trunk. One day (after about 60-70% of my code was working), I decided to do an svn update, and all of a sudden there was a huge bunch of compile failures (in my code as well as the Solr code). Some of them were probably caused by missing/out-of-date JARs in my .classpath. But the point is that Solr code is being actively developed, and there is quite a bit of code churn, and if you really want to work on the trunk (or a pre-release branch), you should be ready to deal with these situtations.

Solr is well designed (so the flow is kind of intuitive) and reasonably well documented, but there are some places where you will probably need to step through the code in a debugger to figure out what's going on. I am still using the Jetty container in the examples subdirectory. This page on Lucid Imagination outlines the steps you need to run Solr within Eclipse using the Jetty plugin, but thanks to the information on this StackOverlow page, all I did was add some command-line parameters to the java call, like so:

1
2
3

sujit@cyclone:example$ java -Dsolr.solr.home=my_schema \
  -agentlib:jdwp=transport=dt_socket,server=y,address=8883,suspend=n \
  -jar start.jar

and then set up an external debug configuration for localhost:8883 in Eclipse, and I could step through the code just fine.

Solr has very aggressive caching (which is great for a production environment), but for development, you need to disable it. I did this by commenting out all the cache references for filterCache, queryResultCache and documentCache in solrconfig.xml, and changed the httpCaching to use never304="true". All these are in the solrconfig.xml file.

Conclusion

The approach I described here is not as performant as the "standard" flow. Because I have to do multiple searches in a single request, I am doing more I/O. I am also consuming more CPU cycles since I have to dedup documents across each layer. I am also consuming more memory per request because I populate the SolrDocument inline rather than just pass the DocListAndSet to the ResponseBuilder. I don't see a way around it, though, given the nature of my requirements.

If you are a Solr expert, or someone who is familiar with the internals, I would appreciate hearing your thoughts about this approach - criticisms and suggestions are welcome.

http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html

转自：http://www.cnblogs.com/chenying99/p/3470393

2019-03-27 01:13

知识点

Solr

[How to] Make custom search with Nutch(v 1.0)?(转)

http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/  What is Nutch?  Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make i

Riak Search

Basho: Riak Search    Riak Search   Introduction   Operations   Indexing   Querying   Persistence   Major Components   Replication   Further Reading   Introduction   Riak Search is a distributed, easi

Faceted search

http://en.wikipedia.org/wiki/Faceted_search http://wiki.apache.org/solr/SolrFacetingOverview http://idratherbewriting.com/2010/05/20/faceted-classification-faceted-search-organizing-content-6/ http://

使用Hibernate+solr取代hibernate search

尝试使用solr取代hibernate search的方法，因本人对二者没有全面的了解，对二者都只是使用API级别,本文仅供参考。 hibernate 4.1,solr3.6.0 本人已实现， 1.开箱即用的solr,就象hibernate search,只要引入JAR包，会自己注入事件，当sessionFactory初始化结束后，即更新schema.xml. 2.hibernate进行inser

solr faceted search

Faceted Search with Solr    Posted byyonik    Faceted search has become a critical feature for enhancing findability and the user search experience for all types of search applications. In this articl

Realtime Search: Solr vs Elasticsearch

Realtime Search: Solr vs Elasticsearch | Socialcast Engineering    Realtime Search: Solr vs Elasticsearch   Ryan SonnekRyan Sonnek   Tuesday May 31st, 2011 by Ryan Sonnek   19 comments   Tweet   What

Haystack - Search for Django

Haystack - Search for Django          Search doesn't have to be hard. Haystack lets you write your search     code once and choose the search engine you want it to run on. With a     familiar API that

自定义solr的search UI

solr使用apache的velocity来定义UI，在solr的search ui的基础上更改即可，主要改\example\solr\collection1\conf\velocity里的文件。 详情可见http://crackedtownship.blogspot.com/2012/07/customize-apaches-solr-ui-using.html 转自：http://www.cn

Open Source Search Engines in Java

Open Source Search Engines in Java Open Source Search Engines in Java Compass The Compass Framework is a first class open source Java framework, enabling the power of Search Engine semantics to your a

Drupal Forums instead of phpBB or vBulletin: A casestudy

5th Jan, 10    Drupal  drupal advanced forum  drupal forum  drupal forum styles  drupal phpBB  drupal user badge  drupal vBulletin  phpBB  vBulletin  Coding   For years I have been using phpBB for a b

Django and full-text search

Structure in the flow » Blog Archive » Django and full-text search   Django and full-text search     13th February 2009, 11:18 am          Lately I’ve been searching for a simple solution

Full-Text Search in ASP.NET using Lucene.NET

This post is about the full-text search engine      Lucene.NET   and how I integrated it into      BugTracker.NET   . If you are thinking of adding full-text search to your application, you might find

Search smarter with Apache Solr, Part 1: Essential features and the Solr schema

Search smarter with Apache Solr, Part 1: Essential features and the Solr schema http://www.ibm.com/developerworks/java/library/j-solr1/ 转自：http://www.cnblogs.com/cy163/archive/2009/09/17/1568336

使用hibernate search和solr简化企业级搜索

最近刚忙完一个电影网站，其中的全文搜索的功能我就是用Solr完成的，在此将我在开发中遇到的问题以及怎样解决问题的经验拿出来与大家分享。 我们这个网站有一个站内搜索的功能，例如站内新闻，输入关键字去搜索。数据库里有上万条数据，如果去挨个like，效率会很低，经领导指点，说可以试一试 HibernateSearch和Apache solr结合mmseg4j分词进行全文检索，于是我就开始我的Solr之旅

redis从库只读设置-redis集群管理

默认情况下redis数据库充当slave角色时是只读的不能进行写操作，如果写入，会提示以下错误：READONLY You can't write against a read only slave.  127.0.0.1:6382> set k3 111  (error) READONLY You can't write against a read only slave. 如果你要开启从库

Netty环境配置

netty是一个java事件驱动的网络通信框架，也就是一个jar包，只要在项目里引用即可。

Netty基于流的传输处理

在TCP/IP的基于流的传输中，接收的数据被存储到套接字接收缓冲器中。不幸的是，基于流的传输的缓冲器不是分组的队列，而是字节的队列。 这意味着，即使将两个消息作为两个独立的数据包发送，操作系统也不会将它们视为两个消息，而只是一组字节(有点悲剧)。 因此，不能保证读的是您在远程定入的行数据

Netty入门实例-使用POJO代替ByteBuf

使用TIME协议的客户端和服务器示例，让它们使用POJO来代替原来的ByteBuf。

Netty入门实例-时间服务器

Netty中服务器和客户端之间最大的和唯一的区别是使用了不同的Bootstrap和Channel实现

Netty入门实例-编写服务器端程序

channelRead()处理程序方法实现如下

Netty开发环境配置

最新版本的Netty 4.x和JDK 1.6及更高版本

电商平台数据库设计

电商平台数据库表设计：商品分类表、商品信息表、品牌表、商品属性表、商品属性扩展表、规格表、规格扩展表

HttpClient 上传文件

我们使用MultipartEntityBuilder创建一个HttpEntity。 当创建构建器时，添加一个二进制体 - 包含将要上传的文件以及一个文本正文。 接下来，使用RequestBuilder创建一个HTTP请求，并分配先前创建的HttpEntity。

MongoDB常用命令

查看当前使用的数据库    > db    test  切换数据库   > use foobar    switched to db foobar  插入文档    > post={"title":"领悟书生","content":"这是一个分享教程的网站","date":new

快速了解MongoDB【基本概念与体系结构】

什么是MongoDB MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。

windows系统安装MongoDB

安装 下载MongoDB的安装包：mongodb-win32-x86_64-2008plus-ssl-3.2.10-signed.msi，按照提示步骤安装即可。 安装完成后，软件会安装在C:\Program Files\MongoDB 目录中 我们要启动的服务程序就是C:\Program Files\MongoDB\Server\3.2\bin目录下的mongod.exe，为了方便我们每次启动，我

Spring boot整合MyBatis-Plus 之二：增删改查

基于上一篇springboot整合MyBatis-Plus之后，实现简单的增删改查 创建实体类 添加表注解TableName和主键注解TableId import com.baomidou.mybatisplus.annotations.TableId;
import com.baomidou.mybatisplus.annotations.TableName;
import com.baom

分布式ID生成器【snowflake雪花算法】

基于snowflake雪花算法分布式ID生成器 snowflake雪花算法分布式ID生成器几大特点： 41bit的时间戳可以支持该算法使用到2082年 10bit的工作机器id可以支持1024台机器 序列号支持1毫秒产生4096个自增序列id 整体上按照时间自增排序 整个分布式系统内不会产生ID碰撞 每秒能够产生26万ID左右 Twitter的 Snowflake分布式ID生成器的JAVA实现方案