Custom SOLR Search Components - 2 Dev Tricks

I've been building some custom search components for SOLR lately, so wanted to share a couple of things I learned in the process. Most likely this is old hat to people who have been doing this for a while, but thought I'd share, just in case it benefits someone...

Passing State

In a previous post, I described a custom SOLR search handler returns layered search results for a given query term (and optional filters). As I went further, though, I realized that I needed to return information relating to facets and category clusters as well. Of course, I could have added this stuff into the handler itself, but splitting the logic across a chain of search components seemed to be more preferable, readability and reusability wise, so I went that route.

So the first step was to refactor my custom SearchHandler into a SearchComponent. Not much to do there, except to subclass SearchComponent instead of RequestHandlerBase and move the handleRequestBody(SolrQueryRequest,SolrQueryResponse) to a process(ResponseBuilder) method. The request and response objects are accessible from the ResponseBuilder as properties, ie, ResponseBuilder.req and ResponseBuilder.rsp. I then declared this component and an enclosing handler in solrconfig.xml, something like this:

  <!-- this used to be my search handler -->
  <searchComponent name="component1"
      class="org.apache.solr.handler.component.ext.MyComponent1">
    <str name="prop1">value1</str>
    <str name="prop2">value2</str>
  </searchComponent>
  <searchComponent name="component2" 
      class="org.apache.solr.handler.component.ext.MyComponent2">
    <lst name="facets">
      <str name="prop1">1</str>
      <str name="prop2">2</str>
    </lst>
  </searchComponent>
  <requestHandler name="/mysearch2" 
      class="org.apache.solr.handler.component.SearchHandler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*,score,id</str>
      <str name="wt">xml</str>
    </lst>
    <arr name="components">
      <str>component1</str>
      <str>component2</str>
      <!-- ... more components as needed ... -->
    </arr>
  </requestHandler>

I've also added a second component to the chain above (just so I don't have to show this snippet again later), hope its not too confusing. Obviously there can be multiple components before and after my search handler turned search component, but for the purposes of this discussion, I'll keep things simple and just concentrate on this one other component and pretend that it has multiple unique (and pertinent) requirements.

Now, assume that the second component needed data that was already available, or can be easily generated by component1. Its actually true in my case, since I needed a BitSet of document ids in the search results in my second component, which I could easily get by collecting them while looping through the SolrDocumentList of results in my first component. So it seemed kind of wasteful to compute this again. So I updated this snippet of code in component1's process() method (what used to be my handleRequestBody() method):

  public void process(ResponseBuilder rb) throws IOException {
    ...
    // build and write response
    ...
    OpenBitSet bits = new OpenBitSet(searcher.maxDoc());
    List<SolrDocument> slice = new ArrayList<SolrDocument>();
    for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
      SolrDocument sdoc = it.next();
      ...
      bits.set(Long.valueOf((Integer) sdoc.get("id")));
      if (numFound >= start && numFound < start + rows) {
        slice.add(sdoc);
      }
      numFound++;
    }
    ...
    rsp.add("response", results);
    rsp.add("_bits", bits);
  }

In my next component (component2), I simply grab the OpenBitSet data structure by name from the NamedList, use them to generate the result for this component, stick the result back into the response, and discard the temporary data. The last is so that the data does not appear on the response XML (for both aesthetic and performance reasons).

  public void process(ResponseBuilder rb) throws IOException {
    Map<String,Object> cres = new HashMap<String,Object>();
    NamedList nl = rb.rsp.getValues();
    OpenBitSet bits = (OpenBitSet) nl.get("_bits");
    if (bits == null) {
      logger.warn("Component 1 must write _bits into response");
      rb.rsp.add(COMPONENT_NAME, cres);
      return;
    }
    // do something with bits and generate component response
    doSomething(bits, cres);
    // stick the result into the response and delete temp data
    rb.rsp.add("component2_result", cres);
    rb.rsp.getValues().remove("_bits");
  }

Before I did this, I investigated if I could subclass the XmlResponseWriter to ignore NamedLists with "hidden" names (ie names prefixed with underscore), but the XmlResponseWriter calls XMLWriter which does the actual XML generation, and XMLWriter is final (at least in SOLR 1.4.1). Good thing too, forced me to look for and find a simpler solution :-).

So there you have it - a simple way to pass data between components in a SOLR Search RequestHandler. Note that it does mean that component2 is always dependent on component1 (or some other component that produces the same data) upstream to it, so these components are no longer truly reusable pieces of code. But this can be useful if you really need it and you document the requirement (or complain about it if not met, as I've done here).

Reacting to a COMMIT

The second thing I needed to do in component2 was to give it some reference data that it would need to compute its results. The reference data is generated from the contents of the index, and the generation is fairly heavyweight, so you don't want to do this on every request.

Now one of the cool things about SOLR is its built-in incremental indexing feature (one of the main reasons we considered using SOLR in the first place), so you can POST data to a running SOLR instance followed by a COMMIT, and voila: your searcher re-opens with the new data.

Of course, this also means that if we want to provide accurate information, the reference data should be regenerated whenever the searcher is reopened. The way I went about doing this is mostly derived from how the SpellCheckerComponent does it, in order to regenerate its dictionaries -- by hooking into the SOLR event framework.

To do this, my component2 implements SolrCoreAware in addition to extending SearchComponent. This requires me to implement the inform(SolrCore) method, which is invoked by SOLR after the init(NamedList) but before prepare(ResponseBuilder) and process(ResponseBuilder). In the inform(SolrCore) method, I register a listener for the firstSearcher and newSearcher events (described in more detail here).

I then build the inner listener class, which implements SolrEventListener, which requires me to provide implementations for newSearcher() and postCommit() methods. Since my listener is a query-side listener, I provide an empty implementation for postCommit(). The newSearcher() method contains the code to generate the reference sets. Here is the relevant snippet of code from the component.

public class MyComponent2 extends SearchComponent implements SolrCoreAware {

  private RefData refdata; // this needs to be regenerated on COMMIT

  @Override
  public void init(NamedList args) {
    ...
  }

  @Override
  public void inform(SolrCore core) {
    listener = new MyComponent2Listener();
    core.registerFirstSearcherListener(listener);
    core.registerNewSearcherListener(listener);
  }

  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
    ...
  }

  @Override
  public void process(ResponseBuilder rb) throws IOException {
    ...
    // do something with refdata
    ...
  }

  private class MyComponent2Listener implements SolrEventListener {
    
    @Override
    public void init(NamedList args) { /* NOOP */ }

    @Override
    public void newSearcher(SolrIndexSearcher newSearcher,
        SolrIndexSearcher currentSearcher) {
      RefData copy = new RefData();
      copy = generateRefData(newSearcher);
      refdata.clear();
      refdata.addAll(copy);
    }

    @Override
    public void postCommit() { /* NOOP */ }
  }
  ...
}

Notice that I have registered the listener to listen on both firstSearcher and newSearcher events. This way, it gets called on SOLR startup (reacting to a firstSearcher event), and again each time the searcher is reopened (reacting to a newSearcher event).

One other thing... since the generation of RefData takes some time, its best to have the listener's newSearcher method build a copy and then repopulate the refdata variable from the copy, that way the component continues to use the old data until the new one is available.

And thats pretty much it for today. Till next time.

http://sujitpal.blogspot.com/2011/04/custom-solr-search-components-2-dev.html

转自：http://www.cnblogs.com/chenying99/p/3470395

2019-03-27 01:13

知识点

Solr

Solr: a custom Search RequestHandler

As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework

[How to] Make custom search with Nutch(v 1.0)?(转)

http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/  What is Nutch?  Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make i

solr faceted search

Faceted Search with Solr    Posted byyonik    Faceted search has become a critical feature for enhancing findability and the user search experience for all types of search applications. In this articl

Haystack - Search for Django

Haystack - Search for Django          Search doesn't have to be hard. Haystack lets you write your search     code once and choose the search engine you want it to run on. With a     familiar API that

Faceted search

http://en.wikipedia.org/wiki/Faceted_search http://wiki.apache.org/solr/SolrFacetingOverview http://idratherbewriting.com/2010/05/20/faceted-classification-faceted-search-organizing-content-6/ http://

Realtime Search: Solr vs Elasticsearch

Realtime Search: Solr vs Elasticsearch | Socialcast Engineering    Realtime Search: Solr vs Elasticsearch   Ryan SonnekRyan Sonnek   Tuesday May 31st, 2011 by Ryan Sonnek   19 comments   Tweet   What

自定义solr的search UI

solr使用apache的velocity来定义UI，在solr的search ui的基础上更改即可，主要改\example\solr\collection1\conf\velocity里的文件。 详情可见http://crackedtownship.blogspot.com/2012/07/customize-apaches-solr-ui-using.html 转自：http://www.cn

使用Hibernate+solr取代hibernate search

尝试使用solr取代hibernate search的方法，因本人对二者没有全面的了解，对二者都只是使用API级别,本文仅供参考。 hibernate 4.1,solr3.6.0 本人已实现， 1.开箱即用的solr,就象hibernate search,只要引入JAR包，会自己注入事件，当sessionFactory初始化结束后，即更新schema.xml. 2.hibernate进行inser

Full-Text Search in ASP.NET using Lucene.NET

This post is about the full-text search engine      Lucene.NET   and how I integrated it into      BugTracker.NET   . If you are thinking of adding full-text search to your application, you might find

Solr总结

http://www.cnblogs.com/guozk/p/3498831.html Solr调研总结 1. Solr 是什么？  Solr它是一种开放源码的、基于 Lucene Java 的搜索服务器，易于加入到 Web 应用程序中。Solr 提供了层面搜索(就是统计)、命中醒目显示并且支持多种输出格式（包括XML/XSLT 和JSON等格式）。它易于安装和配置，而且附带了一个基于HTTP 的

Open Source Search Engines in Java

Open Source Search Engines in Java Open Source Search Engines in Java Compass The Compass Framework is a first class open source Java framework, enabling the power of Search Engine semantics to your a

Django and full-text search

Structure in the flow » Blog Archive » Django and full-text search   Django and full-text search     13th February 2009, 11:18 am          Lately I’ve been searching for a simple solution

Search smarter with Apache Solr, Part 1: Essential features and the Solr schema

Search smarter with Apache Solr, Part 1: Essential features and the Solr schema http://www.ibm.com/developerworks/java/library/j-solr1/ 转自：http://www.cnblogs.com/cy163/archive/2009/09/17/1568336

使用hibernate search和solr简化企业级搜索

最近刚忙完一个电影网站，其中的全文搜索的功能我就是用Solr完成的，在此将我在开发中遇到的问题以及怎样解决问题的经验拿出来与大家分享。 我们这个网站有一个站内搜索的功能，例如站内新闻，输入关键字去搜索。数据库里有上万条数据，如果去挨个like，效率会很低，经领导指点，说可以试一试 HibernateSearch和Apache solr结合mmseg4j分词进行全文检索，于是我就开始我的Solr之旅

redis从库只读设置-redis集群管理

默认情况下redis数据库充当slave角色时是只读的不能进行写操作，如果写入，会提示以下错误：READONLY You can't write against a read only slave.  127.0.0.1:6382> set k3 111  (error) READONLY You can't write against a read only slave. 如果你要开启从库

Netty环境配置

netty是一个java事件驱动的网络通信框架，也就是一个jar包，只要在项目里引用即可。

Netty基于流的传输处理

在TCP/IP的基于流的传输中，接收的数据被存储到套接字接收缓冲器中。不幸的是，基于流的传输的缓冲器不是分组的队列，而是字节的队列。 这意味着，即使将两个消息作为两个独立的数据包发送，操作系统也不会将它们视为两个消息，而只是一组字节(有点悲剧)。 因此，不能保证读的是您在远程定入的行数据

Netty入门实例-使用POJO代替ByteBuf

使用TIME协议的客户端和服务器示例，让它们使用POJO来代替原来的ByteBuf。

Netty入门实例-时间服务器

Netty中服务器和客户端之间最大的和唯一的区别是使用了不同的Bootstrap和Channel实现

Netty入门实例-编写服务器端程序

channelRead()处理程序方法实现如下

Netty开发环境配置

最新版本的Netty 4.x和JDK 1.6及更高版本

电商平台数据库设计

电商平台数据库表设计：商品分类表、商品信息表、品牌表、商品属性表、商品属性扩展表、规格表、规格扩展表

HttpClient 上传文件

我们使用MultipartEntityBuilder创建一个HttpEntity。 当创建构建器时，添加一个二进制体 - 包含将要上传的文件以及一个文本正文。 接下来，使用RequestBuilder创建一个HTTP请求，并分配先前创建的HttpEntity。

MongoDB常用命令

查看当前使用的数据库    > db    test  切换数据库   > use foobar    switched to db foobar  插入文档    > post={"title":"领悟书生","content":"这是一个分享教程的网站","date":new

快速了解MongoDB【基本概念与体系结构】

什么是MongoDB MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。

windows系统安装MongoDB

安装 下载MongoDB的安装包：mongodb-win32-x86_64-2008plus-ssl-3.2.10-signed.msi，按照提示步骤安装即可。 安装完成后，软件会安装在C:\Program Files\MongoDB 目录中 我们要启动的服务程序就是C:\Program Files\MongoDB\Server\3.2\bin目录下的mongod.exe，为了方便我们每次启动，我

Spring boot整合MyBatis-Plus 之二：增删改查

基于上一篇springboot整合MyBatis-Plus之后，实现简单的增删改查 创建实体类 添加表注解TableName和主键注解TableId import com.baomidou.mybatisplus.annotations.TableId;
import com.baomidou.mybatisplus.annotations.TableName;
import com.baom

分布式ID生成器【snowflake雪花算法】

基于snowflake雪花算法分布式ID生成器 snowflake雪花算法分布式ID生成器几大特点： 41bit的时间戳可以支持该算法使用到2082年 10bit的工作机器id可以支持1024台机器 序列号支持1毫秒产生4096个自增序列id 整体上按照时间自增排序 整个分布式系统内不会产生ID碰撞 每秒能够产生26万ID左右 Twitter的 Snowflake分布式ID生成器的JAVA实现方案