Solr: a custom Search RequestHandler
As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework around it, which has served us well so far. The rationale for moving to Solr is driven primarily by the need to expose our search tier as a service for our internal applications. While it would have been relatively simple (probably simpler) to slap on an HTTP interface over our current search tier, we also want to use the other Solr features such as incremental indexing and replication.
One of our challenges to using Solr is that the way we do search is quite different from the way Solr does search. A query string passed to the default Solr search handler is parsed into a Lucene query and a single search call is made on the underlying index. In our case, the query string is passed to our taxonomy, and depending on the type of query (as identified by the taxonomy), it is sent through one or more sub-handlers. Each sub-handler converts the query into a (different) Lucene query and executes the search against the underlying index. The results from each sub-handler are then layered together to present the final search result.
Conceptually, the customization is quite simple - simply create a custom subclass of RequestHandlerBase (as advised on this wiki page) and override the handleRequestBody(SolrQueryRequest, SolrQueryResponse) method. In reality, I had quite a tough time doing this, admittedly caused (at least partly) by my ignorance of Solr internals. However, I did succeed, so, in this post, I outline my solution, along with some advice I feel would be useful to others embarking on a similar route.
Configuration and Code
The handler is configured to trigger in response to a /solr/mysearch request. Here is the (rewritten for readability) XML snippet from my solrconfig.xml file. I used the "invariants" block to pass in configuration parameters for the handler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
...
<requestHandler name="/mysearch"
class="org.apache.solr.handler.ext.MyRequestHAndler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="fl">*,score</str>
<str name="wt">xml</str>
</lst>
<lst name="invariants">
<str name="prop1">value1</str>
<int name="prop2">value2</int>
<!-- ... more config items here ... -->
</lst>
</requestHandler>
...
|
And here is the (also rewritten for readability) code for the custom handler. I used the SearchHandler and MoreLikeThisHandler as my templates, but diverged from it in several ways in order to accomodate my requirements. I will describe them below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
package org.apache.solr.handler.ext;
// imports omitted
public class MyRequestHandler extends RequestHandlerBase {
private String prop1;
private String prop2;
...
private TaxoService taxoService;
@Override
public void init(NamedList args) {
super.init(args);
this.prop1 = invariants.get("prop1");
this.prop2 = Integer.valueOf(invariants.get("prop2"));
...
this.taxoService = new TaxoService(prop1);
}
@Override
public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
throws Exception {
// extract params from request
SolrParams params = req.getParams();
String q = params.get(CommonParams.Q);
String[] fqs = params.getParams(CommonParams.FQ);
int start = 0;
try { start = Integer.parseInt(params.get(CommonParams.START)); }
catch (Exception e) { /* default */ }
int rows = 0;
try { rows = Integer.parseInt(params.get(CommonParams.ROWS)); }
catch (Exception e) { /* default */ }
SolrPluginUtils.setReturnFields(req, rsp);
// build initial data structures
TaxoResult taxoResult = taxoService.getResult(q);
SolrDocumentList results = new SolrDocumentList();
SolrIndexSearcher searcher = req.getSearcher();
Map<String,SchemaField> fields = req.getSchema().getFields();
int ndocs = start + rows;
Filter filter = buildFilter(fqs, req);
Set<Integer> alreadyFound = new HashSet<Integer>();
// invoke the various sub-handlers in turn and return results
doSearch1(results, searcher, q, filter, taxoResult, ndocs, req,
fields, alreadyFound);
doSearch2(results, searcher, q, filter, taxoResult, ndocs, req,
fields, alreadyFound);
// ... more sub-handler calls here ...
// build and write response
float maxScore = 0.0F;
int numFound = 0;
List<SolrDocument> slice = new ArrayList<SolrDocument>();
for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
SolrDocument sdoc = it.next();
Float score = (Float) sdoc.getFieldValue("score");
if (maxScore < score) {
maxScore = score;
}
if (numFound >= start && numFound < start + rows) {
slice.add(sdoc);
}
numFound++;
}
results.clear();
results.addAll(slice);
results.setNumFound(numFound);
results.setMaxScore(maxScore);
results.setStart(start);
rsp.add("response", results);
}
private Filter buildFilter(String[] fqs, SolrQueryRequest req)
throws IOException, ParseException {
if (fqs != null && fqs.length > 0) {
BooleanQuery fquery = new BooleanQuery();
for (int i = 0; i < fqs.length; i++) {
QParser parser = QParser.getParser(fqs[i], null, req);
fquery.add(parser.getQuery(), Occur.MUST);
}
return new CachingWrapperFilter(new QueryWrapperFilter(fquery));
}
return null;
}
private void doSearch1(SolrDocumentList results,
SolrIndexSearcher searcher, String q, Filter filter,
TaxoResult taxoResult, int ndocs, SolrQueryRequest req,
Map<String,SchemaField> fields, Set<Integer> alreadyFound)
throws IOException {
// check entry condition
if (! canEnterSearch1(q, filter, taxoResult)) {
return;
}
// build custom query and extra fields
Query query = buildCustomQuery1(q, taxoResult);
Map<String,Object> extraFields = new HashMap<String,Object>();
extraFields.put("search_type", "search1");
boolean includeScore =
req.getParams().get(CommonParams.FL).contains("score"));
append(results, searcher.search(
query, filter, maxDocsPerSearcherType).scoreDocs,
alreadyFound, fields, extraFields, maprelScoreCutoff,
searcher.getReader(), includeScore);
}
// ... more doSearchXXX() calls here ...
private void append(SolrDocumentList results, ScoreDoc[] more,
Set<Integer> alreadyFound, Map<String,SchemaField> fields,
Map<String,Object> extraFields, float scoreCutoff,
SolrIndexReader reader, boolean includeScore) throws IOException {
for (ScoreDoc hit : more) {
if (alreadyFound.contains(hit.doc)) {
continue;
}
Document doc = reader.document(hit.doc);
SolrDocument sdoc = new SolrDocument();
for (String fieldname : fields.keySet()) {
SchemaField sf = fields.get(fieldname);
if (sf.stored()) {
sdoc.addField(fieldname, doc.get(fieldname));
}
}
for (String extraField : extraFields.keySet()) {
sdoc.addField(extraField, extraFields.get(extraField));
}
if (includeScore) {
sdoc.addField("score", hit.score);
}
results.add(sdoc);
alreadyFound.add(hit.doc);
}
}
//////////////////////// SolrInfoMBeans methods //////////////////////
@Override
public String getDescription() {
return "My Search Handler";
}
@Override
public String getSource() {
return "$Source$";
}
@Override
public String getSourceId() {
return "$Id$";
}
@Override
public String getVersion() {
return "$Revision$";
}
}
|
Configuration Parameters - I started out baking most of my "configuration" parameters as constants within the handler code, but later moved them into the invariants block in the XML declaration. Not ideal, since we still need to touch the solrconfig.xml file (which is regarded as application code in our environment) to change behavior. The ideal solution, given the circumstances, would probably be to use JNDI to hold the configuration parameters and have the handler connect to the JNDI to pull the properties it needs.
Using Filter - The MoreLikeThis handler converts the fq (filter query) parameter into a List of Query objects, because this is what is needed to pass into a searcher.getDocList(). In my case, I couldn't use DocListAndSet because DocList is unmodifiable (ie, DocList.add() throws an UnsupportedOperationException). So I fell back to the pattern I am used to, which is getting the ScoreDoc[] array from a standard searcher.search(Query,Filter,numDocs) call. That is why the buildFilter() above returns a Filter and not a List<Query>.
Connect to external services - My handler needs to connect to the taxonomy service. Our taxonomy exposes an RMI service with a very rich and fine-grained API. I tried to use this at first, but ran into problems because it needs access to configuration files on the local system, and Jetty couldn't see these files because it was not within its context. I ended up solving for this by exposing a coarse grained JSON service over HTTP on the taxonomy service. The handler calls it once per query and gets back all the information that it needs in a single call. Probably not ideal, since now the logic is spread out in two places - I will probably revisit the RMI client integration again in the future.
Layer multiple resultsets - This is the main reason for writing the custom handler. Most of the work happens in the append() method above. Each sub-handler calls SolrSearcher.search(Query, Filter, numDocs) and populates its resulting ScoreDocs array into a List<SolrDocument>. Since previous sub-handlers may have already returned a result, subsequent sub-handlers check against a Set of docIds.
Add a pseudo-field to the Document - There are currently two competing initiatives in Solr (SOLR-1566 and SOLR-1298) on how to handle this situation. Since I was populating SolrDocument objects (this was one of the reasons I started using SolrDocumentList), it was relatively simple for me to pass in a Map of extra fields which are just tacked on to the end of the SolrDocument.
Some Miscellaneous advice
Here is some advice and tips which I wish someone had told me before I started out on this.
For your own sanity, standardize on a Solr release. I chose 1.4.1 which is the latest at the time of writing this. Prior to that, I was developing within the Solr trunk. One day (after about 60-70% of my code was working), I decided to do an svn update, and all of a sudden there was a huge bunch of compile failures (in my code as well as the Solr code). Some of them were probably caused by missing/out-of-date JARs in my .classpath. But the point is that Solr code is being actively developed, and there is quite a bit of code churn, and if you really want to work on the trunk (or a pre-release branch), you should be ready to deal with these situtations.
Solr is well designed (so the flow is kind of intuitive) and reasonably well documented, but there are some places where you will probably need to step through the code in a debugger to figure out what's going on. I am still using the Jetty container in the examples subdirectory. This page on Lucid Imagination outlines the steps you need to run Solr within Eclipse using the Jetty plugin, but thanks to the information on this StackOverlow page, all I did was add some command-line parameters to the java call, like so:
1 2 3 |
sujit@cyclone:example$ java -Dsolr.solr.home=my_schema \
-agentlib:jdwp=transport=dt_socket,server=y,address=8883,suspend=n \
-jar start.jar
|
and then set up an external debug configuration for localhost:8883 in Eclipse, and I could step through the code just fine.
Solr has very aggressive caching (which is great for a production environment), but for development, you need to disable it. I did this by commenting out all the cache references for filterCache, queryResultCache and documentCache in solrconfig.xml, and changed the httpCaching to use never304="true". All these are in the solrconfig.xml file.
Conclusion
The approach I described here is not as performant as the "standard" flow. Because I have to do multiple searches in a single request, I am doing more I/O. I am also consuming more CPU cycles since I have to dedup documents across each layer. I am also consuming more memory per request because I populate the SolrDocument inline rather than just pass the DocListAndSet to the ResponseBuilder. I don't see a way around it, though, given the nature of my requirements.
If you are a Solr expert, or someone who is familiar with the internals, I would appreciate hearing your thoughts about this approach - criticisms and suggestions are welcome.
http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html
转自:http://www.cnblogs.com/chenying99/p/3470393
知识点
相关教程
更多Custom SOLR Search Components - 2 Dev Tricks
I've been building some custom search components for SOLR lately, so wanted to share a couple of things I learned in the process. Most likely this is old hat to people who have been doing this for a w
[How to] Make custom search with Nutch(v 1.0)?(转)
http://puretech.paawak.com/2009/04/29/how-to-make-custom-search-with-nutchv-10/ What is Nutch? Nutch is an open source web crawler + search engine based on Lucene. These are a few things that make i
Riak Search
Basho: Riak Search Riak Search Introduction Operations Indexing Querying Persistence Major Components Replication Further Reading Introduction Riak Search is a distributed, easi
Faceted search
http://en.wikipedia.org/wiki/Faceted_search http://wiki.apache.org/solr/SolrFacetingOverview http://idratherbewriting.com/2010/05/20/faceted-classification-faceted-search-organizing-content-6/ http://
使用Hibernate+solr取代hibernate search
尝试使用solr取代hibernate search的方法,因本人对二者没有全面的了解,对二者都只是使用API级别,本文仅供参考。 hibernate 4.1,solr3.6.0 本人已实现, 1.开箱即用的solr,就象hibernate search,只要引入JAR包,会自己注入事件,当sessionFactory初始化结束后,即更新schema.xml. 2.hibernate进行inser
solr faceted search
Faceted Search with Solr Posted byyonik Faceted search has become a critical feature for enhancing findability and the user search experience for all types of search applications. In this articl
Realtime Search: Solr vs Elasticsearch
Realtime Search: Solr vs Elasticsearch | Socialcast Engineering Realtime Search: Solr vs Elasticsearch Ryan SonnekRyan Sonnek Tuesday May 31st, 2011 by Ryan Sonnek 19 comments Tweet What
Haystack - Search for Django
Haystack - Search for Django Search doesn't have to be hard. Haystack lets you write your search code once and choose the search engine you want it to run on. With a familiar API that
自定义solr的search UI
solr使用apache的velocity来定义UI,在solr的search ui的基础上更改即可,主要改\example\solr\collection1\conf\velocity里的文件。 详情可见http://crackedtownship.blogspot.com/2012/07/customize-apaches-solr-ui-using.html 转自:http://www.cn
Open Source Search Engines in Java
Open Source Search Engines in Java Open Source Search Engines in Java Compass The Compass Framework is a first class open source Java framework, enabling the power of Search Engine semantics to your a
Drupal Forums instead of phpBB or vBulletin: A casestudy
5th Jan, 10 Drupal drupal advanced forum drupal forum drupal forum styles drupal phpBB drupal user badge drupal vBulletin phpBB vBulletin Coding For years I have been using phpBB for a b
Django and full-text search
Structure in the flow » Blog Archive » Django and full-text search Django and full-text search 13th February 2009, 11:18 am Lately I’ve been searching for a simple solution
Full-Text Search in ASP.NET using Lucene.NET
This post is about the full-text search engine Lucene.NET and how I integrated it into BugTracker.NET . If you are thinking of adding full-text search to your application, you might find
Search smarter with Apache Solr, Part 1: Essential features and the Solr schema
Search smarter with Apache Solr, Part 1: Essential features and the Solr schema http://www.ibm.com/developerworks/java/library/j-solr1/ 转自:http://www.cnblogs.com/cy163/archive/2009/09/17/1568336
使用hibernate search和solr简化企业级搜索
最近刚忙完一个电影网站,其中的全文搜索的功能我就是用Solr完成的,在此将我在开发中遇到的问题以及怎样解决问题的经验拿出来与大家分享。 我们这个网站有一个站内搜索的功能,例如站内新闻,输入关键字去搜索。数据库里有上万条数据,如果去挨个like,效率会很低,经领导指点,说可以试一试 HibernateSearch和Apache solr结合mmseg4j分词进行全文检索,于是我就开始我的Solr之旅
最新教程
更多java线程状态详解(6种)
java线程类为:java.lang.Thread,其实现java.lang.Runnable接口。 线程在运行过程中有6种状态,分别如下: NEW:初始状态,线程被构建,但是还没有调用start()方法 RUNNABLE:运行状态,Java线程将操作系统中的就绪和运行两种状态统称为“运行状态” BLOCK:阻塞状态,表示线程阻塞
redis从库只读设置-redis集群管理
默认情况下redis数据库充当slave角色时是只读的不能进行写操作,如果写入,会提示以下错误:READONLY You can't write against a read only slave. 127.0.0.1:6382> set k3 111 (error) READONLY You can't write against a read only slave. 如果你要开启从库
Netty环境配置
netty是一个java事件驱动的网络通信框架,也就是一个jar包,只要在项目里引用即可。
Netty基于流的传输处理
在TCP/IP的基于流的传输中,接收的数据被存储到套接字接收缓冲器中。不幸的是,基于流的传输的缓冲器不是分组的队列,而是字节的队列。 这意味着,即使将两个消息作为两个独立的数据包发送,操作系统也不会将它们视为两个消息,而只是一组字节(有点悲剧)。 因此,不能保证读的是您在远程定入的行数据
Netty入门实例-使用POJO代替ByteBuf
使用TIME协议的客户端和服务器示例,让它们使用POJO来代替原来的ByteBuf。
Netty入门实例-时间服务器
Netty中服务器和客户端之间最大的和唯一的区别是使用了不同的Bootstrap和Channel实现
Netty入门实例-编写服务器端程序
channelRead()处理程序方法实现如下
Netty开发环境配置
最新版本的Netty 4.x和JDK 1.6及更高版本
电商平台数据库设计
电商平台数据库表设计:商品分类表、商品信息表、品牌表、商品属性表、商品属性扩展表、规格表、规格扩展表
HttpClient 上传文件
我们使用MultipartEntityBuilder创建一个HttpEntity。 当创建构建器时,添加一个二进制体 - 包含将要上传的文件以及一个文本正文。 接下来,使用RequestBuilder创建一个HTTP请求,并分配先前创建的HttpEntity。
MongoDB常用命令
查看当前使用的数据库 > db test 切换数据库 > use foobar switched to db foobar 插入文档 > post={"title":"领悟书生","content":"这是一个分享教程的网站","date":new
快速了解MongoDB【基本概念与体系结构】
什么是MongoDB MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。旨在为WEB应用提供可扩展的高性能数据存储解决方案。
windows系统安装MongoDB
安装 下载MongoDB的安装包:mongodb-win32-x86_64-2008plus-ssl-3.2.10-signed.msi,按照提示步骤安装即可。 安装完成后,软件会安装在C:\Program Files\MongoDB 目录中 我们要启动的服务程序就是C:\Program Files\MongoDB\Server\3.2\bin目录下的mongod.exe,为了方便我们每次启动,我
Spring boot整合MyBatis-Plus 之二:增删改查
基于上一篇springboot整合MyBatis-Plus之后,实现简单的增删改查 创建实体类 添加表注解TableName和主键注解TableId import com.baomidou.mybatisplus.annotations.TableId; import com.baomidou.mybatisplus.annotations.TableName; import com.baom
分布式ID生成器【snowflake雪花算法】
基于snowflake雪花算法分布式ID生成器 snowflake雪花算法分布式ID生成器几大特点: 41bit的时间戳可以支持该算法使用到2082年 10bit的工作机器id可以支持1024台机器 序列号支持1毫秒产生4096个自增序列id 整体上按照时间自增排序 整个分布式系统内不会产生ID碰撞 每秒能够产生26万ID左右 Twitter的 Snowflake分布式ID生成器的JAVA实现方案