Re: Using another way instead of DIH
Hi, It simply means the configuration file of your DIH. Cheers On 26 April 2013 03:37, xiaoqi belivexia...@gmail.com wrote: Thanks for help . data-config.xml ? i can not find this file , u mean data-import.xml or solrconfig.xml ? -- View this message in context: http://lucene.472066.n3.nabble.com/Using-another-way-instead-of-DIH-tp4058937p4059067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr metrics in Codahale metrics and Graphite?
Alan, Shawn, If backporting to 3.x is hard, no worries, we don't necessarily require the patch as we are heading to 4.x eventually. It is just much easier within our organization to test on the existing solr 3.4 as there are a few of internal dependencies and custom code on top of solr. Also solr upgrades on production systems are usually pushed forward by a month or so starting the upgrade on development systems (requires lots of testing and verifications). Nevertheless, it is good effort to make #solr #graphite friendly, so keep it up! :) Dmitry On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote: On 4/25/2013 6:30 AM, Dmitry Kan wrote: We are very much interested in 3.4. On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk wrote: This is on top of trunk at the moment, but would be back ported to 4.4 if there was interest. This will be bad news, I'm sorry: All remaining work on 3.x versions happens in the 3.6 branch. This branch is in maintenance mode. It will only get fixes for serious bugs with no workaround. Improvements and new features won't be considered at all. You're welcome to try backporting patches from newer issues. Due to the major differences in the 3x and 4x codebases, the best case scenario is that you'll be facing a very manual task. Some changes can't be backported because they rely on other features only found in 4.x code. Thanks, Shawn
Re: How do set compression for compression on stored fields in SOLR 4.2.1
Why don't we add a parameter to allow non programmers to change it? Compression=FAST|etc On Thursday, April 25, 2013, Chris Hostetter wrote: : Subject: How do set compression for compression on stored fields in SOLR 4.2.1 : : https://issues.apache.org/jira/browse/LUCENE-4226 : It mentions that we can set compression mode: : FAST, HIGH_COMPRESSION, FAST_UNCOMPRESSION. The compression details are hardcoded into the various codecs. If you wanted to customize this, you'd need to write your own codec subclass... https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/compressing/class-use/CompressionMode.html See, for example, the implementations of Lucene41StoredFieldsFormat and Lucene42TermVectorsFormat... public final class Lucene41StoredFieldsFormat extends CompressingStoredFieldsFormat { /** Sole constructor. */ public Lucene41StoredFieldsFormat() { super(Lucene41StoredFields, CompressionMode.FAST, 1 14); } } public final class Lucene42TermVectorsFormat extends CompressingTermVectorsFormat { /** Sole constructor. */ public Lucene42TermVectorsFormat() { super(Lucene41StoredFields, , CompressionMode.FAST, 1 12); } } -Hoss -- Bill Bell billnb...@gmail.com cell 720-256-8076
Prons an Cons of Startup Lazy a Handler?
I will use SolrCloud and theis main purpose will be rich document indexing. Solr example includes that definition: requestHandler name=/update/extract startup=lazyclass=solr.extraction.ExtractingRequestHandler it startups it lazy. So what is pros and cons for removing it for my situation?
Lucene native facets
Since facets are now included in Lucene, why don't we add a pass through from Solr? The current facet code can live on but we could create new param like facet.lucene=true? Seems like a great enhancement ! -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: what is the maximum XML file size to import?
Thanks to all for your suggestions. -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-maximum-XML-file-size-to-import-tp4058263p4059113.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Indexing Rich Documents
I have a large corpus of rich documents i.e. pdf and doc files. I think that I can use directly the example jar of Solr. However for a real time environment what should I care? Also how do you send such kind of documents into Solr to index, I think post.jar does not handle that file type? I should mention that I don't store documents in a database.
Document is missing mandatory uniqueKey field: id for Solr PDF indexing
I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ !-- Main body of document extracted by SolrCell. NOTE: This field is not indexed by default, since it is also copied to text using copyField below. This is to save space. Use this field for returning and highlighting document content. Use the text field to search the content. -- field name=content type=text_general indexed=false stored=true multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- !-- field name=text type=text_general indexed=true stored=false multiValued=true/ -- !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. -- field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField -- field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true/ I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf However I get that error, any ideas? Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id. On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.comwrote: I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ !-- Main body of document extracted by SolrCell. NOTE: This field is not indexed by default, since it is also copied to text using copyField below. This is to save space. Use this field for returning and highlighting document content. Use the text field to search the content. -- field name=content type=text_general indexed=false stored=true multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- !-- field name=text type=text_general indexed=true stored=false multiValued=true/ -- !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. -- field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField -- field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true/ I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf However I get that error, any ideas? Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at
Re: Using another way instead of DIH
below is my data-import.xml any suggestion ? ?xml version=1.0 encoding=UTF-8? dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://*:3306/guang user=guang password=guang/ document name=products entity name=item pk=id query=SELECT a.*,d.* FROM item a LEFT JOIN item_ctr_by_area d ON a.id=d.item_id LEFT JOIN shop b ON a.shop_id = b.id WHERE a.status =1 AND b.status = 1 AND b.uctrac_status =0 AND uctrac_adgroup_id IS NOT NULL field column=id name=item_id / field column=title name=item_title / field column=description name=item_description / field column=price name=item_price / field column=promotion name=item_promotion / field column=pic_url name=item_picurl / field column=local_pic_url name=item_local_picurl / field column=detail_url name=item_detailurl / field column=recommend_value name=item_recommend_value/ field column=uctrac_adgroup_id name=uctrac_adgroup_id/ field column=uctrac_price name=uctrac_adgroup_price/ field column=uctrac_status name=uctrac_adgroup_status/ field column=uctrac_creative_id name=uctrac_creative_id/ field column=lctr name=item_lctr/ field column=CTR_ALL name=region_ctr_all/ field column=CTR_N name=region_ctr_n/ field column=CTR_MN name=region_ctr_mn/ field column=CTR_MS name=region_ctr_ms/ field column=CTR_S name=region_ctr_s/ field column=CTR_011100 name=region_ctr_0111/ field column=CTR_011300 name=region_ctr_0113/ field column=CTR_012100 name=region_ctr_0121/ field column=CTR_013100 name=region_ctr_0131/ field column=CTR_013200 name=region_ctr_0132/ field column=CTR_013300 name=region_ctr_0133/ field column=CTR_013400 name=region_ctr_0134/ field column=CTR_013500 name=region_ctr_0135/ field column=CTR_013700 name=region_ctr_0137/ field column=CTR_014100 name=region_ctr_0141/ field column=CTR_014200 name=region_ctr_0142/ field column=CTR_014300 name=region_ctr_0143/ field column=CTR_014400 name=region_ctr_0144/ field column=CTR_015100 name=region_ctr_0151/ field column=CTR_016100 name=region_ctr_0161/ field column=CTR_ALL_2 name=region_ctr_all_2/ field column=CTR_N_2 name=region_ctr_n_2/ field column=CTR_MN_2 name=region_ctr_mn_2/ field column=CTR_MS_2 name=region_ctr_ms_2/ field column=CTR_S_2 name=region_ctr_s_2/ field column=CTR_011100_2 name=region_ctr_0111_2/ field column=CTR_011300_2 name=region_ctr_0113_2/ field column=CTR_012100_2 name=region_ctr_0121_2/ field column=CTR_013100_2 name=region_ctr_0131_2/ field column=CTR_013200_2 name=region_ctr_0132_2/ field column=CTR_013300_2 name=region_ctr_0133_2/ field column=CTR_013400_2 name=region_ctr_0134_2/ field column=CTR_013500_2 name=region_ctr_0135_2/ field column=CTR_013700_2 name=region_ctr_0137_2/ field column=CTR_014100_2 name=region_ctr_0141_2/ field column=CTR_014200_2 name=region_ctr_0142_2/ field column=CTR_014300_2 name=region_ctr_0143_2/ field column=CTR_014400_2 name=region_ctr_0144_2/ field column=CTR_015100_2 name=region_ctr_0151_2/ field column=CTR_016100_2 name=region_ctr_0161_2/ field column=CTR_ALL_4 name=region_ctr_all_4/ field column=CTR_N_4 name=region_ctr_n_4/ field column=CTR_MN_4 name=region_ctr_mn_4/ field column=CTR_MS_4 name=region_ctr_ms_4/ field column=CTR_S_4 name=region_ctr_s_4/ field column=CTR_011100_4 name=region_ctr_0111_4/ field column=CTR_011300_4 name=region_ctr_0113_4/ field column=CTR_012100_4 name=region_ctr_0121_4/ field column=CTR_013100_4 name=region_ctr_0131_4/ field column=CTR_013200_4 name=region_ctr_0132_4/ field column=CTR_013300_4 name=region_ctr_0133_4/ field column=CTR_013400_4 name=region_ctr_0134_4/ field column=CTR_013500_4 name=region_ctr_0135_4/ field column=CTR_013700_4 name=region_ctr_0137_4/ field column=CTR_014100_4 name=region_ctr_0141_4/ field column=CTR_014200_4 name=region_ctr_0142_4/ field column=CTR_014300_4 name=region_ctr_0143_4/ field column=CTR_014400_4 name=region_ctr_0144_4/ field column=CTR_015100_4 name=region_ctr_0151_4/ field column=CTR_016100_4 name=region_ctr_0161_4/ field column=votescore
Re: [solr 3.4] anomaly during distributed facet query with 102 shards
Hi, 1. Ruled out possibility to test 4.2.1 router against 3.4 shard farm for obvious reasons (java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format). 2. Tried jetty, but same result. On Thu, Apr 25, 2013 at 5:16 PM, Dmitry Kan solrexp...@gmail.com wrote: Thanks, Yonik. Yes, I supposed that. We are in the pre-release phase, so we have the pressure. Solr 3.4. Would setting up 4.2.1 router work with 3.4 shards? On 25 Apr 2013 17:11, Yonik Seeley yo...@lucidworks.com wrote: On Thu, Apr 25, 2013 at 8:32 AM, Dmitry Kan solrexp...@gmail.com wrote: Are there any distrib facet gurus on the list? I would be ready to try sensible ideas, including on the source code level, if someone of you could give me a hand. The Lucene/Solr Revolution conference is coming up next week, so I think many are busy creating their presentations. What version of Solr are you using? Have you tried using a newer version? Is it reproducable with a smaller cluster? If so, you could try using the included Jetty server instead of Tomcat to rule out that factor. -Yonik http://lucidworks.com
Log Monitor System for SolrCloud and Logging to log4j at SolrCloud?
I want to use GrayLog2 to monitor my logging files for SolrCloud. However I think that GrayLog2 works with log4j and logback. Solr uses slf4j. How can I solve this problem and what logging monitoring system does folks use?
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
Hi Raymond; Now I get that error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: 2013/4/26 Raymond Wiker rwi...@gmail.com You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id. On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ !-- Main body of document extracted by SolrCell. NOTE: This field is not indexed by default, since it is also copied to text using copyField below. This is to save space. Use this field for returning and highlighting document content. Use the text field to search the content. -- field name=content type=text_general indexed=false stored=true multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- !-- field name=text type=text_general indexed=true stored=false multiValued=true/ -- !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. -- field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField -- field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true/ I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf However I get that error, any ideas? Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
http://wiki.apache.org/solr/post.jar -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com: Hi Raymond; Now I get that error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: 2013/4/26 Raymond Wiker rwi...@gmail.com You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id. On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ !-- Main body of document extracted by SolrCell. NOTE: This field is not indexed by default, since it is also copied to text using copyField below. This is to save space. Use this field for returning and highlighting document content. Use the text field to search the content. -- field name=content type=text_general indexed=false stored=true multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- !-- field name=text type=text_general indexed=true stored=false multiValued=true/ -- !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. -- field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField -- field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true/ I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf However I get that error, any ideas? Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
If you can help me it would be nice. I get that error: SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update/extract.. Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file 523387.pdf (application/pdf) SimplePostTool: WARNING: Solr returned an error #404 Not Found SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/update/extract/extract?resource.name=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdfliteral.id=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdf 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/update/extract.. Disconnected from the target VM, address: '127.0.0.1:58385', transport: 'socket' Time spent: 0:00:00.194 and there is nothing indexed. Here is my server log: Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_c,generation=12,filenames=[segments_c] commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_d,generation=13,filenames=[segments_d] Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 13[segments_d] Apr 26, 2013 2:55:58 PM org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@37342445 main Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)} Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrCore registerSearcher INFO: [collection1] Registered new searcher Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)} Apr 26, 2013 2:55:58 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [collection1] webapp=/solr path=/update/extract params={commit=true} {commit=} 0 156 2013/4/26 Jan Høydahl jan@cominvent.com http://wiki.apache.org/solr/post.jar -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com: Hi Raymond; Now I get that error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: 2013/4/26 Raymond Wiker rwi...@gmail.com You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id. On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field
Re: Solr Indexing Rich Documents
It's called SolrCell or the ExtractingRequestHandler (/update/extract), which the newer post.jar knows to use for some file types: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 4:48 AM To: solr-user@lucene.apache.org Subject: Solr Indexing Rich Documents I have a large corpus of rich documents i.e. pdf and doc files. I think that I can use directly the example jar of Solr. However for a real time environment what should I care? Also how do you send such kind of documents into Solr to index, I think post.jar does not handle that file type? I should mention that I don't store documents in a database.
Re: Lucene native facets
Sure, but they are completely different conceptual models of faceting - Solr is dynamic, based on the actual data for the hierarchy, while Lucene is static, based on a predefined taxonomy that must be meticulously created before any data is added. Solr answers the question: what structure does your data have, while Lucene answers the question how does your data fit into a predefined structure. Both are valid and valuable questions, but they are still rather distinct. Yes, Solr should provide support for static facet taxonomies, but what exactly that would look like... has not even been proposed yet, yet alone as simple as facet.lucene=true. OTOH, maybe most of the work may be simply to add taxonomy management to Solr (as a passthrough to the Lucene features), and then maybe a lot of the existing Solr facet parameters simply need parallel Lucene-oriented implementations. But, the other half of Solr facets is how filter queries are used for selecting facets. That's all done at the application level, so it can't be hidden from the app so easily. Maybe a new Solr facet filter API can be developed that can then in turn have Solr facet vs. Lucene facet implementations. Or, maybe a new dynamic facet Lucene API could be added as well, so that Solr facets in fact become a passthrough as well. Still, it would be good to support Lucene facets in Solr. Maybe that could be one of the key turning points for what defines Lucene/Solr 5.0. Is there a Jira for this? I don't recall one. -- Jack Krupansky -Original Message- From: William Bell Sent: Friday, April 26, 2013 4:01 AM To: solr-user@lucene.apache.org Subject: Lucene native facets Since facets are now included in Lucene, why don't we add a pass through from Solr? The current facet code can live on but we could create new param like facet.lucene=true? Seems like a great enhancement ! -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Prons an Cons of Startup Lazy a Handler?
Lazy startup simply means that you are willing to tolerate a slight delay on the first request to that request handler. It also has the side effect that if there are any problems with starting up the handler, they won't be seen until that first request. In short, whether you want to keep the handler is completely independent of the lazy startup option. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 4:01 AM To: solr-user@lucene.apache.org Subject: Prons an Cons of Startup Lazy a Handler? I will use SolrCloud and theis main purpose will be rich document indexing. Solr example includes that definition: requestHandler name=/update/extract startup=lazyclass=solr.extraction.ExtractingRequestHandler it startups it lazy. So what is pros and cons for removing it for my situation?
Re: Solr Indexing Rich Documents
Thanks for the answer, I get an error now: FileNotFound Exception as I mentioned at other thread. Now I' trying to solve it. 2013/4/26 Jack Krupansky j...@basetechnology.com It's called SolrCell or the ExtractingRequestHandler (/update/extract), which the newer post.jar knows to use for some file types: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 4:48 AM To: solr-user@lucene.apache.org Subject: Solr Indexing Rich Documents I have a large corpus of rich documents i.e. pdf and doc files. I think that I can use directly the example jar of Solr. However for a real time environment what should I care? Also how do you send such kind of documents into Solr to index, I think post.jar does not handle that file type? I should mention that I don't store documents in a database.
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
I think that I should start a new thread for my question to help people who searches for same situation. 2013/4/26 Furkan KAMACI furkankam...@gmail.com If you can help me it would be nice. I get that error: SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update/extract.. Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file 523387.pdf (application/pdf) SimplePostTool: WARNING: Solr returned an error #404 Not Found SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: http://localhost:8983/solr/update/extract/extract?resource.name=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdfliteral.id=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdf 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/update/extract .. Disconnected from the target VM, address: '127.0.0.1:58385', transport: 'socket' Time spent: 0:00:00.194 and there is nothing indexed. Here is my server log: Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_c,generation=12,filenames=[segments_c] commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_d,generation=13,filenames=[segments_d] Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 13[segments_d] Apr 26, 2013 2:55:58 PM org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@37342445 main Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)} Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrCore registerSearcher INFO: [collection1] Registered new searcher Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)} Apr 26, 2013 2:55:58 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [collection1] webapp=/solr path=/update/extract params={commit=true} {commit=} 0 156 2013/4/26 Jan Høydahl jan@cominvent.com http://wiki.apache.org/solr/post.jar -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com: Hi Raymond; Now I get that error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: 2013/4/26 Raymond Wiker rwi...@gmail.com You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id. On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com wrote: I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field
SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException
Could anybody help me for my error. When I try to post documents with post.jar I get that error: SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update/extract.. Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file 523387.pdf (application/pdf) SimplePostTool: WARNING: Solr returned an error #404 Not Found *SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException:* http://localhost:8983/solr/update/extract/extract?resource.name=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdfliteral.id=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdf 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/update/extract.. Disconnected from the target VM, address: '127.0.0.1:58385', transport: 'socket' Time spent: 0:00:00.194 and there is nothing indexed. Here is my server log: Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_c,generation=12,filenames=[segments_c] commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_d,generation=13,filenames=[segments_d] Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 13[segments_d] Apr 26, 2013 2:55:58 PM org.apache.solr.search.SolrIndexSearcher init INFO: Opening Searcher@37342445 main Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)} Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener done. Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrCore registerSearcher INFO: [collection1] Registered new searcher Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)} Apr 26, 2013 2:55:58 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: [collection1] webapp=/solr path=/update/extract params={commit=true} {commit=} 0 156 I use that command to post: java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar 523387.pdf
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
Maybe you are confusing things by mixing instructions - there are SEPARATE instructions for directly using SolrCell and implicitly using it via post.jar. Pick which you want and stick with it. DO NOT MIX the instructions. You wrote: I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf Was there a GOOD reason that you chose that URL? Best to stay with what the post.jar wiki recommends: Post all CSV, XML, JSON and PDF documents using AUTO mode which detects type based on file name: java -Dauto -jar post.jar *.csv *.xml *.json *.pdf Or, stick with SolrCell directly, but follow its distinct instructions: http://wiki.apache.org/solr/ExtractingRequestHandler Again, DO NOT MIX the instructions from the two. post.jar is designed so that you do not need to know or care exactly how rich document indexing works. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 5:30 AM To: solr-user@lucene.apache.org Subject: Document is missing mandatory uniqueKey field: id for Solr PDF indexing I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ !-- Main body of document extracted by SolrCell. NOTE: This field is not indexed by default, since it is also copied to text using copyField below. This is to save space. Use this field for returning and highlighting document content. Use the text field to search the content. -- field name=content type=text_general indexed=false stored=true multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- !-- field name=text type=text_general indexed=true stored=false multiValued=true/ -- !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. -- field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField -- field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true/ I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf However I get that error, any ideas? Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639) at
Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing
Jack, thanks for your answers. Ok, when I remove -Durl parameter I think it works, thanks. However I think that I have a problem with my schema. I get that error: Apr 26, 2013 3:52:21 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/exampledocs/523387.pdf] multiple values encountered for non multiValued copy field text: application/pdf 2013/4/26 Jack Krupansky j...@basetechnology.com Maybe you are confusing things by mixing instructions - there are SEPARATE instructions for directly using SolrCell and implicitly using it via post.jar. Pick which you want and stick with it. DO NOT MIX the instructions. You wrote: I run that command: java -Durl= http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf Was there a GOOD reason that you chose that URL? Best to stay with what the post.jar wiki recommends: Post all CSV, XML, JSON and PDF documents using AUTO mode which detects type based on file name: java -Dauto -jar post.jar *.csv *.xml *.json *.pdf Or, stick with SolrCell directly, but follow its distinct instructions: http://wiki.apache.org/solr/ExtractingRequestHandler Again, DO NOT MIX the instructions from the two. post.jar is designed so that you do not need to know or care exactly how rich document indexing works. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 5:30 AM To: solr-user@lucene.apache.org Subject: Document is missing mandatory uniqueKey field: id for Solr PDF indexing I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. Some metadata is parsed from the documents, but there are some which come from the client context: content_type: From the HTTP headers of incoming stream resourcename: From SolrCell request param resource.name -- field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=resourcename type=text_general indexed=true stored=true/ field name=url type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ !-- Main body of document extracted by SolrCell. NOTE: This field is not indexed by default, since it is also copied to text using copyField below. This is to save space. Use this field for returning and highlighting document content. Use the text field to search the content. -- field name=content type=text_general indexed=false stored=true multiValued=true/ !-- catchall field, containing all other searchable text fields (implemented via copyField further on in this schema -- !-- field name=text type=text_general indexed=true stored=false multiValued=true/ -- !-- catchall text field that indexes tokens both normally and in reverse for efficient leading wildcard queries. -- field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ !-- non-tokenized version of manufacturer to make it easier to sort or group results by manufacturer. copied from manu via copyField -- field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ field name=_version_ type=long indexed=true stored=true/ I run that command: java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf However I get that error, any ideas? Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id at org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at
uniqueKey required false for multivalued id when indexing rich documents
I am new to Solr and try to index rich files. I have defined that at my schema: field name=id type=string indexed=true stored=true required=true multiValued=false / and there is a line at my schema: uniqueKeyid/uniqueKey should I make it like that: uniqueKey required=false/uniqueKey for my purpose?
How to define a generic field to hold all undefined fields
I sen some documents to my Solr to be indexed. However I get such kind of errors: ERROR: [doc=0579B002] unknown field 'name' I know that I should define a field named 'name' at mu schema. However there maybe many of fields like that. How can I define a generic field that holds all non defined values or maybe how can I ignore them?
Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException
On Fri, Apr 26, 2013 at 2:45 PM, Furkan KAMACI furkankam...@gmail.comwrote: I use that command to post: java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar 523387.pdf I think you need to have the collection name in the url... something like http://localhost:8983/solr/mycollection/update/extract .
Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException
On 26 April 2013 18:15, Furkan KAMACI furkankam...@gmail.com wrote: Could anybody help me for my error. When I try to post documents with post.jar I get that error: [...] I use that command to post: java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar 523387.pdf The URL should be http://localhost:8983/solr/update . You have an extra /extract . Actually, if you are running from embedded Jetty, you should be able to skip the -Durl argument. Regards, Gora
Re: uniqueKey required false for multivalued id when indexing rich documents
On 26 April 2013 18:38, Furkan KAMACI furkankam...@gmail.com wrote: I am new to Solr and try to index rich files. I have defined that at my schema: [...] uniqueKey required=false/uniqueKey This will not work: Please see http://wiki.apache.org/solr/UniqueKey for different use cases for the uniqueKey. For documents, I usually use the document name, or some segment of the filesystem path as the uniqueKey as that is automatically guaranteed to be unique. Regards, Gora
Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException
Ok, solved 2013/4/26 Raymond Wiker rwi...@gmail.com On Fri, Apr 26, 2013 at 2:45 PM, Furkan KAMACI furkankam...@gmail.com wrote: I use that command to post: java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar 523387.pdf I think you need to have the collection name in the url... something like http://localhost:8983/solr/mycollection/update/extract .
Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException
I have not indicated a URL and it solved as you mention. Because default URL does not include /extract 2013/4/26 Furkan KAMACI furkankam...@gmail.com Ok, solved 2013/4/26 Raymond Wiker rwi...@gmail.com On Fri, Apr 26, 2013 at 2:45 PM, Furkan KAMACI furkankam...@gmail.com wrote: I use that command to post: java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar 523387.pdf I think you need to have the collection name in the url... something like http://localhost:8983/solr/mycollection/update/extract .
Re: How to define a generic field to hold all undefined fields
A dynamic field with the name pattern * and a type of string, stored=true, indexed=true and multiValued=true should be good enough for a generic field. Generally, only use thing in test/experiment/development. It's not recommended as an approach for production apps. There is a commented out * pattern in the example schema, but it ignores all incoming data, rather than index and store it. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 9:16 AM To: solr-user@lucene.apache.org Subject: How to define a generic field to hold all undefined fields I sen some documents to my Solr to be indexed. However I get such kind of errors: ERROR: [doc=0579B002] unknown field 'name' I know that I should define a field named 'name' at mu schema. However there maybe many of fields like that. How can I define a generic field that holds all non defined values or maybe how can I ignore them?
Re: SOLR Install
Hi Peri, I think that document mesa you can deploy your own web app and solr in one container like tomcat,but with different context path. If you want to bring solr in your project, you just need add some maven dependencies like: dependency groupIdorg.apache.solr/groupId artifactIdsolr-core/artifactId version4.2.1/version /dependency dependency groupIdorg.apache.solr/groupId artifactIdsolr-test-framework/artifactId version${solr.version}/version /dependency This is what I do exactly. Then you need to prefer a 'solr.home' dir in your project,and write some code in web.xml to configure some filter and servlet what solr need.I copied those configures and some admin page form solr.war. I hope this could help ,best regards! 在 2013-4-25,上午12:58,Peri Subrahmanya peri.subrahma...@htcinc.com 写道: I m trying to use solr as part of another maven based web application. I m not sure how to wire the two war files. Any help please? I found this documentation in SOLR but unsure how to go about it. !-- If you are wiring Solr into a larger web application which controls the web context root, you will probably want to mount Solr under a path prefix (app.war with /app/solr mounted into it, for example). You will need to put this prefix in front of the SolrDispatchFilter url-pattern mapping too (/solr/*), and also on any paths for legacy Solr servlet mappings you may be using. For the Admin UI to work properly in a path-prefixed configuration, the admin folder containing the resources needs to be under the app context root named to match the path-prefix. For example: .war xxx js main.js -- !-- init-param param-namepath-prefix/param-name param-value/xxx/param-value /init-param -- Thank you, Peri Subrahmanya On 4/24/13 12:52 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: solrservice.php and the text of that error both sound like parts of Typo3... they're definitely not part of Solr. You should ask on a list devoted to Typo3 to figure out what to do in this situation. It likely won't involve reconfiguring Solr. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn¹t a Game On Wed, Apr 24, 2013 at 11:53 AM, vishal gupta vishalgup...@yahoo.co.in wrote: Hi i am using Solr 4.2.0 and extension 2.8.2 with Typo3. Whever I try to do indexing pages and news pages It gets only 3.29% indexed. I checked a developer log and found error in solrservice.php. And in solr admin it is giving Dups is not defined please add it. What should i do in this case? If possible please send me the settings of schema.xml and solrconfig.xml .i am new to typo3 and solr both. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indeing-Partially-working-tp40586 23.html Sent from the Solr - User mailing list archive at Nabble.com. *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Re: Log Monitor System for SolrCloud and Logging to log4j at SolrCloud?
Slf4j is meant to work with existing frameworks - you can set it up to work with log4j, and Solr will use log4j by default in the about to be released 4.3. http://wiki.apache.org/solr/SolrLogging - Mark On Apr 26, 2013, at 7:19 AM, Furkan KAMACI furkankam...@gmail.com wrote: I want to use GrayLog2 to monitor my logging files for SolrCloud. However I think that GrayLog2 works with log4j and logback. Solr uses slf4j. How can I solve this problem and what logging monitoring system does folks use?
AutoSuggest+Grouping in one request
Hi everyone, Search dropdowns on popular sites like Amazon (example imagehttp://i.imgur.com/aQyM8WD.jpg) use autosuggested words along with grouping (Field Collapsing in Solr). While I can replicate the same functionality in Solr using two requests (first to obtain suggestions, second for the actual query using the most probable suggestion), I want to know if this can be done in one request itself. I understand that there are various ways to obtain suggestions (term component, facets, Solr's inbuilt Suggesterhttp://wiki.apache.org/solr/Suggester), and I'm open to using any one of them, if it means I'll be able to get everything (groups + suggestions) in one request. Looking forward to some advice with regard to this. Thanks, Rounak
RE: Using another way instead of DIH
yes, I misspoke. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: xiaoqi [mailto:belivexia...@gmail.com] Sent: Thursday, April 25, 2013 8:37 PM To: solr-user@lucene.apache.org Subject: RE: Using another way instead of DIH Thanks for help . data-config.xml ? i can not find this file , u mean data-import.xml or solrconfig.xml ? -- View this message in context: http://lucene.472066.n3.nabble.com/Using-another-way-instead-of-DIH-tp4058937p4059067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR Install
If you unpack the solr.war file,you'll find some configures in web.xml like: filter filter-nameSolrRequestFilter/filter-name filter-classorg.apache.solr.servlet.SolrDispatchFilter/filter-class /filter filter-mapping filter-nameSolrRequestFilter/filter-name url-pattern/*/url-pattern /filter-mapping servlet servlet-nameZookeeper/servlet-name servlet-classorg.apache.solr.servlet.ZookeeperInfoServlet/servlet-class /servlet servlet servlet-nameLoadAdminUI/servlet-name servlet-classorg.apache.solr.servlet.LoadAdminUiServlet/servlet-class /servlet and so on… These configurations tells your application how to dispatch requests to solr. Note that the SolrRequestFilter in solr.war's web.xml mapped to url pattern /*, if you want to make some sub-context for solr,maybe it should like /solr/* ,then you need put the web resources like admin.html/css/img/js/tpl from solr.war in the *same* directory of your web app's WebRoot folder.For example,if you make SolrRequestFilter mapped to url pattern to /solr/* ,your WebRoot dir is looks like: WebRoot | solr | ---admin.html | ---css |…….. It is what the comment said in web.xml of solr.war,and I think it also what make you confused like you asked in the original email thread. In my web.xml , I just copy the whole content of solr,paste them in mine and edit some url mapping. 在 2013-4-26,下午10:04,Peri Subrahmanya peri.subrahma...@htcinc.com 写道: Jundan, I got all the setup correctly i.e got the maven dependencies, and using maven overlay copy all the solr files to the WEB-INF directory and also specify solr.home. The issue is that when I try to access any of the solr urls like /admin.html or /dataimport, nothing seems to be happening. So I m not sure how to correctly the web.xml; Would it be possible to share your web.xml please? Thank you, Peri Subrahmanya HTC Global Services (Development Manager for System Integration on Kuali OLE project @ Indiana University, Bloomington, USA) Cell: (+1) 618.407.3521 Skype/Gtalk: peri.subrahmanya *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose. On 4/26/13 9:51 AM, jnduan jnd...@gmail.com wrote: Hi Peri, I think that document mesa you can deploy your own web app and solr in one container like tomcat,but with different context path. If you want to bring solr in your project, you just need add some maven dependencies like: dependency groupIdorg.apache.solr/groupId artifactIdsolr-core/artifactId version4.2.1/version /dependency dependency groupIdorg.apache.solr/groupId artifactIdsolr-test-framework/artifactId version${solr.version}/version /dependency This is what I do exactly. Then you need to prefer a 'solr.home' dir in your project,and write some code in web.xml to configure some filter and servlet what solr need.I copied those configures and some admin page form solr.war. I hope this could help ,best regards! 在 2013-4-25,上午12:58,Peri Subrahmanya peri.subrahma...@htcinc.com 写道: I m trying to use solr as part of another maven based web application. I m not sure how to wire the two war files. Any help please? I found this documentation in SOLR but unsure how to go about it. !-- If you are wiring Solr into a larger web application which controls the web context root, you will probably want to mount Solr under a path prefix (app.war with /app/solr mounted into it, for example). You will need to put this prefix in front of the SolrDispatchFilter url-pattern mapping too (/solr/*), and also on any paths for legacy Solr servlet mappings you may be using. For the Admin UI to work properly in a path-prefixed configuration, the admin folder containing the resources needs to be under the app context root named to match the path-prefix. For example: .war xxx js main.js -- !-- init-param param-namepath-prefix/param-name param-value/xxx/param-value /init-param -- Thank you, Peri Subrahmanya On 4/24/13 12:52 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: solrservice.php and the text of that error both sound like parts of Typo3... they're definitely not part of Solr. You should ask on a list devoted to Typo3 to figure out what to do in this situation. It
RE: Using another way instead of DIH
Here are some things I would try: 1. Make sure the parent entity is only returning 1 row per solr document. If not, move the problems joins to child entities to their own queries and child entities. 2. For the child entites, use caching. This prevents the n+1 select problem. The changes are: remove the pk attribute (only the parent entity needs this, and only to support delta updates). remove the where clause from the query add cacheKey/cacheLookup to each child like this: cacheKey='id' cacheLookup='item.shop_id' add cacheImpl=SortedMapBackedCache to each child. This will cache in-memory. 3. If caching uses too much memory, see https://issues.apache.org/jira/browse/SOLR-2613 https://issues.apache.org/jira/browse/SOLR-2948 . These are disk-backed cache implementations that you can use as alternatives to SortedMapBackedCache. Or you can write your own. 4. If it is still too slow, you can parallelize it by splitting the data into partitions then running multiple DIH handlers at once. This is a somewhat complex solution but still might be easier than writing a multi-threaded import program yourself. One way to partition SQL data like this is to add a where clause like where mod(id, 4)=${dataimporter.request.partitionNumber} I will mention that I recently converted one of our applications to use its own solrj-based code to update instead of DIH. We were using BerkleyBackedCache from SOLR-2613 to handle the child entites, and it worked well. But the app dev team wanted something that was part of their codebase that they could maintain more easily, so we migrated off of DIH. We do updates more frequently and batch the updates so everything can fit in-memory. Doing it this way, the SolrJ code was very straightforward and quick easy to write. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: xiaoqi [mailto:belivexia...@gmail.com] Sent: Friday, April 26, 2013 5:10 AM To: solr-user@lucene.apache.org Subject: Re: Using another way instead of DIH below is my data-import.xml any suggestion ? ?xml version=1.0 encoding=UTF-8? dataConfig dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://*:3306/guang user=guang password=guang/ document name=products entity name=item pk=id query=SELECT a.*,d.* FROM item a LEFT JOIN item_ctr_by_area d ON a.id=d.item_id LEFT JOIN shop b ON a.shop_id = b.id WHERE a.status =1 AND b.status = 1 AND b.uctrac_status =0 AND uctrac_adgroup_id IS NOT NULL field column=id name=item_id / field column=title name=item_title / field column=description name=item_description / field column=price name=item_price / field column=promotion name=item_promotion / field column=pic_url name=item_picurl / field column=local_pic_url name=item_local_picurl / field column=detail_url name=item_detailurl / field column=recommend_value name=item_recommend_value/ field column=uctrac_adgroup_id name=uctrac_adgroup_id/ field column=uctrac_price name=uctrac_adgroup_price/ field column=uctrac_status name=uctrac_adgroup_status/ field column=uctrac_creative_id name=uctrac_creative_id/ field column=lctr name=item_lctr/ field column=CTR_ALL name=region_ctr_all/ field column=CTR_N name=region_ctr_n/ field column=CTR_MN name=region_ctr_mn/ field column=CTR_MS name=region_ctr_ms/ field column=CTR_S name=region_ctr_s/ field column=CTR_011100 name=region_ctr_0111/ field column=CTR_011300 name=region_ctr_0113/ field column=CTR_012100 name=region_ctr_0121/ field column=CTR_013100 name=region_ctr_0131/ field column=CTR_013200 name=region_ctr_0132/ field column=CTR_013300 name=region_ctr_0133/ field column=CTR_013400 name=region_ctr_0134/ field column=CTR_013500 name=region_ctr_0135/ field column=CTR_013700 name=region_ctr_0137/ field column=CTR_014100 name=region_ctr_0141/ field column=CTR_014200 name=region_ctr_0142/ field column=CTR_014300 name=region_ctr_0143/ field column=CTR_014400 name=region_ctr_0144/ field column=CTR_015100 name=region_ctr_0151/ field column=CTR_016100 name=region_ctr_0161/ field column=CTR_ALL_2 name=region_ctr_all_2/ field column=CTR_N_2 name=region_ctr_n_2/ field column=CTR_MN_2 name=region_ctr_mn_2/ field column=CTR_MS_2 name=region_ctr_ms_2/ field column=CTR_S_2 name=region_ctr_s_2/ field column=CTR_011100_2 name=region_ctr_0111_2/ field column=CTR_011300_2 name=region_ctr_0113_2/
SolrDocument getFieldNames() exclude dynamic fields?
Hi All, I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a query. When I use SolrDocument's getFieldNames(), I get back a list of fields that excludes dynamic fields (even though I know they are not empty). Is there a way to get a list of all fields for a given SolrDocument? Thanks, Luis
Re: Solr Indexing Rich Documents
Hi Furkan, post.jar meant to be used as example, quick start etc. For production (incremental updates, deletes) consider using http://manifoldcf.apache.org for indexing rich documents. It utilises ExtractingRequestHandler feature of solr. --- On Fri, 4/26/13, Furkan KAMACI furkankam...@gmail.com wrote: From: Furkan KAMACI furkankam...@gmail.com Subject: Re: Solr Indexing Rich Documents To: solr-user@lucene.apache.org Date: Friday, April 26, 2013, 3:39 PM Thanks for the answer, I get an error now: FileNotFound Exception as I mentioned at other thread. Now I' trying to solve it. 2013/4/26 Jack Krupansky j...@basetechnology.com It's called SolrCell or the ExtractingRequestHandler (/update/extract), which the newer post.jar knows to use for some file types: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 4:48 AM To: solr-user@lucene.apache.org Subject: Solr Indexing Rich Documents I have a large corpus of rich documents i.e. pdf and doc files. I think that I can use directly the example jar of Solr. However for a real time environment what should I care? Also how do you send such kind of documents into Solr to index, I think post.jar does not handle that file type? I should mention that I don't store documents in a database.
Re: SolrDocument getFieldNames() exclude dynamic fields?
Apologies, I wasn't storing these dynamic fields. On Fri, Apr 26, 2013 at 11:01 AM, Luis Lebolo luis.leb...@gmail.com wrote: Hi All, I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a query. When I use SolrDocument's getFieldNames(), I get back a list of fields that excludes dynamic fields (even though I know they are not empty). Is there a way to get a list of all fields for a given SolrDocument? Thanks, Luis
excluding something from copyfield source?
Hi; I use that: copyField source=* dest=text/ however I want to exclude something i.e. author field. How can I do that?
Re: Solr Indexing Rich Documents
Is there any example at wiki for Manifold? 2013/4/26 Ahmet Arslan iori...@yahoo.com Hi Furkan, post.jar meant to be used as example, quick start etc. For production (incremental updates, deletes) consider using http://manifoldcf.apache.orgfor indexing rich documents. It utilises ExtractingRequestHandler feature of solr. --- On Fri, 4/26/13, Furkan KAMACI furkankam...@gmail.com wrote: From: Furkan KAMACI furkankam...@gmail.com Subject: Re: Solr Indexing Rich Documents To: solr-user@lucene.apache.org Date: Friday, April 26, 2013, 3:39 PM Thanks for the answer, I get an error now: FileNotFound Exception as I mentioned at other thread. Now I' trying to solve it. 2013/4/26 Jack Krupansky j...@basetechnology.com It's called SolrCell or the ExtractingRequestHandler (/update/extract), which the newer post.jar knows to use for some file types: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 4:48 AM To: solr-user@lucene.apache.org Subject: Solr Indexing Rich Documents I have a large corpus of rich documents i.e. pdf and doc files. I think that I can use directly the example jar of Solr. However for a real time environment what should I care? Also how do you send such kind of documents into Solr to index, I think post.jar does not handle that file type? I should mention that I don't store documents in a database.
Exclude Pattern at Dynamic Field
I use that at my Solr 4.2.1: dynamicField name=* type=ignored multiValued=true/ however can I exlude some patterns from it?
Re: Using another way instead of DIH
On 4/25/2013 9:00 AM, xiaoqi wrote: i using DIH to build index is slow , when it fetch 2 million rows , it will spend 20 minutes , very slow. If it takes 20 minutes for two million records, I'd say it's working very well. I do six simultaneous MySQL imports of 13 million records each. It takes a little over 3 hours on Solr 3.5.0, a little over four hours on Solr 4.2.1 (due to compression and the transaction log). If I do them one at a time instead of all at once, it will go *slightly* faster for each one, but the overall process would take a whole day. For comparison purposes, that's about 20 minutes each time it does 1 million rows. Yours is going twice as fast as mine. Looking at your config file, I don't see a batchSize parameter. This is a change that is specific to MySQL. You can greatly reduce the memory usage by including this attribute in the dataSource tag along with the user and password: batchSize=-1 With two million records and no batchSize parameter, I'm surprised you aren't hitting an Out Of Memory error. By default JDBC will pull down all the results and store them in memory, then DIH will begin indexing. A batchSize of -1 makes DIH tell the MySQL JDBC driver to stream the results instead of storing them. Reducing the memory usage in this way might make it go faster. Thanks, Shawn
DataImportHandler - Indexing xml content
I have a column in my database that is of type long text and holds xml content. I was wondering when I define the entity record is there a way to provide a custom extractor that will take in the xml and return rows with appropriate fields to be indexed. Thank you, Peri Subrahmanya On 4/26/13 12:24 PM, Shawn Heisey s...@elyograg.org wrote: On 4/25/2013 9:00 AM, xiaoqi wrote: i using DIH to build index is slow , when it fetch 2 million rows , it will spend 20 minutes , very slow. If it takes 20 minutes for two million records, I'd say it's working very well. I do six simultaneous MySQL imports of 13 million records each. It takes a little over 3 hours on Solr 3.5.0, a little over four hours on Solr 4.2.1 (due to compression and the transaction log). If I do them one at a time instead of all at once, it will go *slightly* faster for each one, but the overall process would take a whole day. For comparison purposes, that's about 20 minutes each time it does 1 million rows. Yours is going twice as fast as mine. Looking at your config file, I don't see a batchSize parameter. This is a change that is specific to MySQL. You can greatly reduce the memory usage by including this attribute in the dataSource tag along with the user and password: batchSize=-1 With two million records and no batchSize parameter, I'm surprised you aren't hitting an Out Of Memory error. By default JDBC will pull down all the results and store them in memory, then DIH will begin indexing. A batchSize of -1 makes DIH tell the MySQL JDBC driver to stream the results instead of storing them. Reducing the memory usage in this way might make it go faster. Thanks, Shawn *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended recipient, please delete without copying and kindly advise us by e-mail of the mistake in delivery. NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global Services to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of e-mail for such purpose.
Customizing Solr GUI
Hi, I want to customize Solr gui, and I learnt that the most popular options are 1. Velocity- which is integrated with Solr. The format and options can be customized 2. Project Blacklight Pros and cons? Secondly I read that one can delete data by just running a delete query in the URL. Does either velocity or blacklight provide a way to disable this, or provide any kind of security or access control- so that users can only browse/search and admins can view the admin screen. How can we handle the security aspect in Solr? -- View this message in context: http://lucene.472066.n3.nabble.com/Customizing-Solr-GUI-tp4059257.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: excluding something from copyfield source?
On 26 April 2013 20:51, Furkan KAMACI furkankam...@gmail.com wrote: Hi; I use that: copyField source=* dest=text/ however I want to exclude something i.e. author field. How can I do that? Instead of using *, use separate copyField directives for the fields that you want copied. You can use more restrictive globs also, e.g., copyField source=*_txt dest=text/ Regards, Gora
Re: Customizing Solr GUI
Generally, your UI web pages should communicate with your own application layer, which in turn communicates with Solr, but you should try to avoid having Solr itself visible to the outside world. -- Jack Krupansky -Original Message- From: kneerosh Sent: Friday, April 26, 2013 12:46 PM To: solr-user@lucene.apache.org Subject: Customizing Solr GUI Hi, I want to customize Solr gui, and I learnt that the most popular options are 1. Velocity- which is integrated with Solr. The format and options can be customized 2. Project Blacklight Pros and cons? Secondly I read that one can delete data by just running a delete query in the URL. Does either velocity or blacklight provide a way to disable this, or provide any kind of security or access control- so that users can only browse/search and admins can view the admin screen. How can we handle the security aspect in Solr? -- View this message in context: http://lucene.472066.n3.nabble.com/Customizing-Solr-GUI-tp4059257.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImportHandler - Indexing xml content
Have you looked at: http://wiki.apache.org/solr/DataImportHandler#FieldReaderDataSource ? Regards, Alex. On Fri, Apr 26, 2013 at 12:29 PM, Peri Subrahmanya peri.subrahma...@htcinc.com wrote: I have a column in my database that is of type long text and holds xml content. I was wondering when I define the entity record is there a way to provide a custom extractor that will take in the xml and return rows with appropriate fields to be indexed. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Customizing Solr GUI
So, building on this: 1) Velocity is an option for internal admin interface because it is collocated with Solr and therefore does not 'hide' it 2) Blacklight is the (Rails-based) application layer and the Solr is internal behind it, so it does provide the security. Hope this helps to understand the distinction. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Apr 26, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.com wrote: Generally, your UI web pages should communicate with your own application layer, which in turn communicates with Solr, but you should try to avoid having Solr itself visible to the outside world. -- Jack Krupansky -Original Message- From: kneerosh Sent: Friday, April 26, 2013 12:46 PM To: solr-user@lucene.apache.org Subject: Customizing Solr GUI Hi, I want to customize Solr gui, and I learnt that the most popular options are 1. Velocity- which is integrated with Solr. The format and options can be customized 2. Project Blacklight Pros and cons? Secondly I read that one can delete data by just running a delete query in the URL. Does either velocity or blacklight provide a way to disable this, or provide any kind of security or access control- so that users can only browse/search and admins can view the admin screen. How can we handle the security aspect in Solr? -- View this message in context: http://lucene.472066.n3.nabble.com/Customizing-Solr-GUI-tp4059257.html Sent from the Solr - User mailing list archive at Nabble.com.
relevance when merging results
Hi, I'm currently using Solr 4.0 final on tomcat v7.0.3x I have 2 cores (let's call them A and B) and I need to combine them as one for the UI. However we're having trouble on how to best merge these two result sets. Currently, I'm using relevancy to do the merge. For example, I search for red in both cores. Core A has a max score of .919856 with 87 results Core B has a max score or .6532563 with 30 results I would like to simply merge numerically but I don't know if that's valid. If I merge in numerical order then Core B results won't appear until element 25 or later. I initially thought about just taking the top 5 results from each and layer one on top of the other. Is there a best practice out there for merging relevancy? Please advise... Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/relevance-when-merging-results-tp4059275.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exclude Pattern at Dynamic Field
No, other than to be explicit about individual patterns, which is better anyway. Generally, * is a crutch or experimental tool (Let's just see what all the data and metadata is and then decide what to keep). It is better to use explicit patterns or static schema for production use. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 11:29 AM To: solr-user@lucene.apache.org Subject: Exclude Pattern at Dynamic Field I use that at my Solr 4.2.1: dynamicField name=* type=ignored multiValued=true/ however can I exlude some patterns from it?
Re: Prons an Cons of Startup Lazy a Handler?
: In short, whether you want to keep the handler is completely independent of : the lazy startup option. I think Jack missread your question -- my interpretation is that you are asking about the pros/cons of removing 'startup=lazy' ... : requestHandler name=/update/extract : startup=lazyclass=solr.extraction.ExtractingRequestHandler : : it startups it lazy. So what is pros and cons for removing it for my : situation? ...if you know you will definitely be using this handler, then you should probably remove startup=lazy -- the advantages of using lazy request handlers is that there is no init cost for having them in your config if you never use them, making it handy for the example configs that many people copy and re-use w/o modifying so that they don't pay any price in having features declared that htey don't use. -Hoss
Re: Weird query issues
Hello Shawn, We found that it is unrelated to the group queries instead more related to the empty queries. Do you happen to know what could cause empty queries like the following from SOLRJ ? I can generate similar query via curl hitting the select handler like - http://server:port/solr/select server.log_2013-04-26T05-02-22:[#|2013-04-26T04:33:39.065-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=httpSSLWorkerTh read-9001-11;|[xxxcore] webapp=/solr path=/select params={} hits=24099 status=0 QTime=19 |#] What we are seeing is a huge number of these empty queries. Once this happens I have observed 2 things 1. even if I query from admin console, irrespective of the query, I get same results as if its a cached page of *:* query. i.e. I cannot see the query I entered in the server log, the query doesn't even come to the server but I get same results as *:* 2. If I query via solrj no results are returned. This has been driving me nuts for almost a week. Any help is greatly appreciated. Thanks Ravi Kiran Bhaskar On Sat, Apr 20, 2013 at 10:33 PM, Ravi Solr ravis...@gmail.com wrote: Thanks for your advise Shawn. I have created a JIRA issue SOLR-4743. On Sat, Apr 20, 2013 at 4:32 PM, Shawn Heisey s...@elyograg.org wrote: On 4/20/2013 9:08 AM, Ravi Solr wrote: Thanks you very much for responding Shawn. I never use IE, I use firefox. These are brand new servers and I don't think I am mixing versions. What made you think I was using the 1.4.1 ?? You are correct in saying that the server is throwing HTML response since a group query has been failing with SEVERE error following which the entire instance behaves weirdly until we restart. Its surprising that group query error handling has such glaring issue. If you specify group=true but don't specify group.query or group.field SOLR throws a SEVERE exception following which we see the empty queries and finally no responses via solrj and admin console gives numFound always equal to total number of docs in index . Looks like the searcher goes for a spin once it encounters the exception. Such situation should have been gracefully handled Ah, so what's happening is that after an invalid grouping query, Solr is unstable and stops working right. You should file an issue in Jira, giving as much detail as you can. My last message was almost completely wrong. You are right that it should be gracefully handled, and obviously it is not. For the 3.x Solr versions, grouping did not exist before 3.6. It is a major 4.x feature that was backported. Sometimes such major features depend on significant changes that have not happened on older versions, leading to problems like this. Unfortunately, you could wait quite a while for a fix on 3.6, where active development has stopped. I have no personal experience with grouping, but I just tried the problematic query (adding group=true to one that works) on 4.2.1. It doesn't throw an error, I just get no results. When I follow it with a regular query, everything works perfectly. Would you be able to upgrade to 4.2.1? That's not a trivial thing to do, so hopefully you are already working on upgrading. Thanks, Shawn
Re: Need to log query request before it is processed
Solved this using a custom SearchHandler and some Log4J goodness. Posting here in case anyone has need for logging query request before they are executed, which in my case is useful for tracking any queries that cause OOMs My solution uses Log4J's NDC support to log each query request before it is processed ... the trick was that the SolrCore.execute method logs at the very end so I wasn't able to push and pop the NDC from first- and last- SearchComponents respectively. In other words, SolrCore logs the query after all the search components complete so I couldn't pop the NDC stack in a last-component. Consequently, I created a simple extension to SearchHandler that relies on the SolrRequestInfo close hook to pop the NDC: public class NDCLoggingSearchHandler extends SearchHandler implements Closeable { private static final Logger log = Logger.getLogger(NDCLoggingSearchHandler.class); private static final AtomicInteger ndc = new AtomicInteger(0); public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) { SolrRequestInfo.getRequestInfo().addCloseHook(this); NDC.push(Q:+ndc.incrementAndGet()); log.info(req.getParamString()); super.handleRequest(req, rsp); } public void close() throws IOException { NDC.remove(); } } Now I get nice logging like: 2013-04-26 19:07:52,545 [qtp1480462011-13] INFO analytics.solr.NDCLoggingSearchHandler Q:20 - indent=trueq=*:*wt=xml 2013-04-26 19:07:52,717 [qtp1480462011-13] INFO solr.core.SolrCore Q:20 - [solr_signal] webapp=/solr path=/select params={indent=trueq=*:*wt=xml} hits=25389931 status=0 QTime=172 The Q:20 part is the NDC. Cheers, Tim PS - I am so happy that Mark switched things to Log4J for 4.3 - https://issues.apache.org/jira/browse/SOLR-3706 +1x10 On Thu, Apr 25, 2013 at 5:44 PM, Sudhakar Maddineni maddineni...@gmail.com wrote: HI Tim, Have you tried by enabling the logging levels on httpclient, which is used by solrj classes internally? Thx,Sudhakar. On Thu, Apr 25, 2013 at 10:12 AM, Timothy Potter thelabd...@gmail.comwrote: I would like to log query requests before they are processed. Currently, it seems they are only logged after being processed. I've tried enabling a finer logging level but that didn't seem to help. I've enabled request logging in Jetty but most queries come in as POSTs from SolrJ I was thinking of adding a query request logger as a first-component but wanted to see what others have done for this? Thanks. Tim
Re: Need to log query request before it is processed
I see.Thanks for sharing. -Sudhakar. On Friday, April 26, 2013, Timothy Potter wrote: Solved this using a custom SearchHandler and some Log4J goodness. Posting here in case anyone has need for logging query request before they are executed, which in my case is useful for tracking any queries that cause OOMs My solution uses Log4J's NDC support to log each query request before it is processed ... the trick was that the SolrCore.execute method logs at the very end so I wasn't able to push and pop the NDC from first- and last- SearchComponents respectively. In other words, SolrCore logs the query after all the search components complete so I couldn't pop the NDC stack in a last-component. Consequently, I created a simple extension to SearchHandler that relies on the SolrRequestInfo close hook to pop the NDC: public class NDCLoggingSearchHandler extends SearchHandler implements Closeable { private static final Logger log = Logger.getLogger(NDCLoggingSearchHandler.class); private static final AtomicInteger ndc = new AtomicInteger(0); public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) { SolrRequestInfo.getRequestInfo().addCloseHook(this); NDC.push(Q:+ndc.incrementAndGet()); log.info(req.getParamString()); super.handleRequest(req, rsp); } public void close() throws IOException { NDC.remove(); } } Now I get nice logging like: 2013-04-26 19:07:52,545 [qtp1480462011-13] INFO analytics.solr.NDCLoggingSearchHandler Q:20 - indent=trueq=*:*wt=xml 2013-04-26 19:07:52,717 [qtp1480462011-13] INFO solr.core.SolrCore Q:20 - [solr_signal] webapp=/solr path=/select params={indent=trueq=*:*wt=xml} hits=25389931 status=0 QTime=172 The Q:20 part is the NDC. Cheers, Tim PS - I am so happy that Mark switched things to Log4J for 4.3 - https://issues.apache.org/jira/browse/SOLR-3706 +1x10 On Thu, Apr 25, 2013 at 5:44 PM, Sudhakar Maddineni maddineni...@gmail.com javascript:; wrote: HI Tim, Have you tried by enabling the logging levels on httpclient, which is used by solrj classes internally? Thx,Sudhakar. On Thu, Apr 25, 2013 at 10:12 AM, Timothy Potter thelabd...@gmail.comjavascript:; wrote: I would like to log query requests before they are processed. Currently, it seems they are only logged after being processed. I've tried enabling a finer logging level but that didn't seem to help. I've enabled request logging in Jetty but most queries come in as POSTs from SolrJ I was thinking of adding a query request logger as a first-component but wanted to see what others have done for this? Thanks. Tim
Not In query
Hi all. We have an index with 300.000 documents and a lot, a lot of fields. We're planning a module where users will choose some documents to exclude from their search results. So, these documents will be excluded for UserA and visible for UserB. So, we have some options to do this. The simplest way is to do a Not In query in document id. But we don't know the performance impact this will have. Is this an option? There is another reasonable way to accomplish this? Thank's * -- * *E conhecereis a verdade, e a verdade vos libertará. (João 8:32)* *andre.maldonado*@gmail.com andre.maldon...@gmail.com (11) 9112-4227 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.facebook.com/profile.php?id=10659376883 http://twitter.com/andremaldonado http://www.delicious.com/andre.maldonado https://profiles.google.com/105605760943701739931 http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3 http://www.youtube.com/andremaldonado
facet.offset issue (previosly: [solr 3.4] anomaly during distributed facet query with 102 shards)
Hi list, We have encountered a weird bug related to the facet.offset parameter. In short: the more general query is, that generates lots of hits, the higher the risk of the facet.offset parameter to stop working. In more detail: 1. Since getting all facets we need (facet.limit=1000) from around 100 shards didn't work for some broad query terms, like the (yes, we index and search those too), we decided to paginate. 2. The facet page size is set to 100 for all pages starting the second one. We start with: facet.offset=0facet.limit=30, then continue with facet.offset=30facet.limit=100, then facet.offset=100facet.limit=100 and so on, until we get facet.offset=900. All facets work just fine, until we hit facet.offset=700. Debugging showed, that in the class HttpCommComponent static Executor instance is created with a setting to terminate idle threads after 5 sec. Our belief, is that this setting way too low for our billion document scenario and broad searches. Setting this to 5 min seems to improve the situation a bit, but not solve fully. This same class is no longer used in 4.2.1 (can anyone tell what's used instead in distributed faceting?) so it isn't easy to compare these parts of the code. Anyhow, playing now with this value in the hope to see some light in the tunnel (would be good, if it is not the train). One more question: can this be related to RAM allocation on the router and / or shards? If RAM isn't enough for some operations, why the router or shards wouldn't just crash with OOM? If anyone has other ideas for what to try / look into, that'll be much appreciated. Dmitry
Re: Weird query issues
On 4/26/2013 1:01 PM, Ravi Solr wrote: Hello Shawn, We found that it is unrelated to the group queries instead more related to the empty queries. Do you happen to know what could cause empty queries like the following from SOLRJ ? I can generate similar query via curl hitting the select handler like - http://server:port/solr/select server.log_2013-04-26T05-02-22:[#|2013-04-26T04:33:39.065-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=httpSSLWorkerTh read-9001-11;|[xxxcore] webapp=/solr path=/select params={} hits=24099 status=0 QTime=19 |#] What we are seeing is a huge number of these empty queries. Once this happens I have observed 2 things 1. even if I query from admin console, irrespective of the query, I get same results as if its a cached page of *:* query. i.e. I cannot see the query I entered in the server log, the query doesn't even come to the server but I get same results as *:* 2. If I query via solrj no results are returned. This has been driving me nuts for almost a week. Any help is greatly appreciated. Querying from the admin UI and not seeing anything in the server log sounds like browser caching. You can turn that off in solrconfig.xml. I could not duplicate what you're seeing with SolrJ. You didn't say what version of SolrJ, so I did this using 3.6.2 (same as your server version). I thought maybe if you had a query object that didn't have an actual query set, it might do what you're seeing, but that doesn't appear to be the case. I don't have a 3.6.2 server to test against, so I used my 3.5.0 and 4.2.1 servers. Test code: http://pastie.org/private/bnvurz1f9b9viawgqbxvmq Solr 4.2.1 log: INFO - 2013-04-26 14:17:24.127; org.apache.solr.core.SolrCore; [ncmain] webapp=/solr path=/select params={wt=xmlversion=2.2} hits=0 status=0 QTime=20 3.5.0 server log: Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException Apr 26, 2013 2:20:23 PM org.apache.solr.core.SolrCore execute INFO: [ncmain] webapp=/solr path=/select params={wt=xmlversion=2.2} status=500 QTime=0 Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException Same code without the setParser line: Solr 4.2.1 log: INFO - 2013-04-26 14:14:01.270; org.apache.solr.core.SolrCore; [ncmain] webapp=/solr path=/select params={wt=javabinversion=2} hits=0 status=0 QTime=187 Thanks, Shawn
Re: How to define a generic field to hold all undefined fields
I can highly recommend reading the documentation before asking questions :) You are using the ExtractingRequestHandler, which is documented on the WIKI like most other stuff is. The fastest way to search for Solr stuff would be using search-lucene.com http://search-lucene.com/?q=extracting+request+handlerfc_project=Solr Reading that wiki page you'll notice the parameters uprefix and defaultField which would both be ways to solve your problem. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 26. apr. 2013 kl. 15:16 skrev Furkan KAMACI furkankam...@gmail.com: I sen some documents to my Solr to be indexed. However I get such kind of errors: ERROR: [doc=0579B002] unknown field 'name' I know that I should define a field named 'name' at mu schema. However there maybe many of fields like that. How can I define a generic field that holds all non defined values or maybe how can I ignore them?
IOexception, when using Solr 4.2.1 for indexing
Hi All, I get the error below on trying to index using Solr 4.2.1. I have a single core setup and use HttpSolrServer with DefaultHttpClient to talk to Solr. #Here is how HttpSolrServer is instantiated: solrServer = new HttpSolrServer( baseURL, configurator.createHttpClient( new BasicHttpParams( ) ) ); #DefaultHttpClient creation: public DefaultHttpClient createHttpClient( HttpParams parameters ) { DefaultHttpClient httpClient = new DefaultHttpClient( connectionManager, parameters ); httpClient.setRoutePlanner( routePlanner ); return httpClient; } Any ideas on what is it that I am doing incorrectly will be appreciated. Thanks! Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://example.com:8080/solr-server-4.2.1 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:416) ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42] at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42] at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42] at com.qpidhealth.qpid.solr.SolrService.saveOrUpdate(SolrService.java:117) ~[classes/:na] ... 84 common frames omitted Caused by: org.apache.http.client.ClientProtocolException: null at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353) ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42] ... 89 common frames omitted Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry request with a non-repeatable request entity. The cause lists the reason the original request fail$ at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) ~[httpclient-4.2.2.jar:4.2.2] ... 92 common frames omitted Caused by: java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) ~[na:1.7.0_05] at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) ~[na:1.7.0_05] at java.net.SocketOutputStream.write(SocketOutputStream.java:153) ~[na:1.7.0_05] at org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:147) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:167) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197) ~[httpclient-4.2.2.jar:4.2.2] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257) ~[httpcore-4.2.2.jar:4.2.2] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.2.2.jar:4.2.2]
Re: Not In query
I would start with the way you propose, a negative filter q=foo barfq=-id:(123 729 640 112...) This will effectively hide those doc ids, and a benefit is that it is cached so if the list of ids is long, you'll only take the performance hit the first time. I don't know your application, but if it is highly likely that a single user will add excludes for several thousand ids then you should perhaps consider other options and benchmark up front. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 26. apr. 2013 kl. 21:50 skrev André Maldonado andre.maldon...@gmail.com: Hi all. We have an index with 300.000 documents and a lot, a lot of fields. We're planning a module where users will choose some documents to exclude from their search results. So, these documents will be excluded for UserA and visible for UserB. So, we have some options to do this. The simplest way is to do a Not In query in document id. But we don't know the performance impact this will have. Is this an option? There is another reasonable way to accomplish this? Thank's * -- * *E conhecereis a verdade, e a verdade vos libertará. (João 8:32)* *andre.maldonado*@gmail.com andre.maldon...@gmail.com (11) 9112-4227 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664 http://www.facebook.com/profile.php?id=10659376883 http://twitter.com/andremaldonado http://www.delicious.com/andre.maldonado https://profiles.google.com/105605760943701739931 http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3 http://www.youtube.com/andremaldonado
Re: How to get/set customized Solr data source properties?
: : I am working on a DataSource implementation. I want to get some customized : properties when the *DataSource.init* method is called. I tried to add the ... : dataConfig : dataSource type=com.my.company.datasource : my=value / My understanding from looking at other DataSources is that should work. : But initProps.getProperty(my) == null. can you show us some actual that fails with that dataConfig you mentioned? -Hoss
Re: Solr index searcher to lucene index searcher
: used to call the lucene IndexSearcher . As the documents are collected in : TopDocs in Lucene , before that is passed back to Nutch , i used to look : into the top K matching documents , consult some external repository : and further score the Top K documents and reorder them in the TopDocs array : . These reordered TopDocs is passed to Nutch . All these reordering code : was implemented by Extending the lucene IndexSearcher class . 1) that's basically the same info you provided before -- it still doesn't really tell us anything about what your current logic does with the top K documents and how/why/when you decide to reorder them or by how much -- details that are kind of important in being able to provide you with any meaningful advice on how to achieve your goal using hte plugin hooks available in Solr. 2) if you only care about re-ordering the Top K documents using some secret sauce, then i would suggest you just set rows=K and let Solr do it's thing, the post process the reuslts -- either in your client, or in a SearchComponent that modifies the SolrDocumentList produces by QueryComponent. : can you elaborate on what exactly your some logic involves? ... : https://people.apache.org/~hossman/#xyproblem : XY Problem : : Your question appears to be an XY Problem ... that is: you are dealing : with X, you are assuming Y will help you, and you are asking about Y : without giving more details about the X so that we can understand the : full issue. Perhaps the best solution doesn't involve Y at all? : See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Weird query issues
Thanks Shawn, We are using 3.6.2 client and server. I cleared my browser cache several times while querying (is that similar to clear cache in solrconfig.xml ?). The query is logged in the solrj based client's application container however I see it empty in the solr's application container...so somehow it is getting swallowed by solr...Iam not able to figure out how and why ? Thanks Ravi Kiran Bhaskar On Fri, Apr 26, 2013 at 4:33 PM, Shawn Heisey s...@elyograg.org wrote: On 4/26/2013 1:01 PM, Ravi Solr wrote: Hello Shawn, We found that it is unrelated to the group queries instead more related to the empty queries. Do you happen to know what could cause empty queries like the following from SOLRJ ? I can generate similar query via curl hitting the select handler like - http://server:port/solr/select server.log_2013-04-26T05-02-22:[#|2013-04-26T04:33:39.065-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=httpSSLWorkerTh read-9001-11;|[xxxcore] webapp=/solr path=/select params={} hits=24099 status=0 QTime=19 |#] What we are seeing is a huge number of these empty queries. Once this happens I have observed 2 things 1. even if I query from admin console, irrespective of the query, I get same results as if its a cached page of *:* query. i.e. I cannot see the query I entered in the server log, the query doesn't even come to the server but I get same results as *:* 2. If I query via solrj no results are returned. This has been driving me nuts for almost a week. Any help is greatly appreciated. Querying from the admin UI and not seeing anything in the server log sounds like browser caching. You can turn that off in solrconfig.xml. I could not duplicate what you're seeing with SolrJ. You didn't say what version of SolrJ, so I did this using 3.6.2 (same as your server version). I thought maybe if you had a query object that didn't have an actual query set, it might do what you're seeing, but that doesn't appear to be the case. I don't have a 3.6.2 server to test against, so I used my 3.5.0 and 4.2.1 servers. Test code: http://pastie.org/private/bnvurz1f9b9viawgqbxvmq Solr 4.2.1 log: INFO - 2013-04-26 14:17:24.127; org.apache.solr.core.SolrCore; [ncmain] webapp=/solr path=/select params={wt=xmlversion=2.2} hits=0 status=0 QTime=20 3.5.0 server log: Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException Apr 26, 2013 2:20:23 PM org.apache.solr.core.SolrCore execute INFO: [ncmain] webapp=/solr path=/select params={wt=xmlversion=2.2} status=500 QTime=0 Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException Same code without the setParser line: Solr 4.2.1 log: INFO - 2013-04-26 14:14:01.270; org.apache.solr.core.SolrCore; [ncmain] webapp=/solr path=/select params={wt=javabinversion=2} hits=0 status=0 QTime=187 Thanks, Shawn
Re: Solr index searcher to lucene index searcher
Hi , Thanks Chris . For every document that matches the query i want to able to compute the following set of features for a query document pair LuceneScore ( The vector space score that lucene gives to each doc) LinkScore ( computed from nutch ) OpicScore ( computed from nutch) co-rd in title,content,anchor,url wt of Entity in title,content,anchor,url length of title,content,anchor,url sum-of-tf in title,content,anchor,url sum-of-norm-tf in title,content,anchor,url min-of-tf in title,content,anchor,url max-of-tf in title,content,anchor,url variance-of-tf in title,content,anchor,url sum-of-tf-idf in title,content,anchor,url site-reputation-score enity-support-score domain score url-click-count query-url-click-count num-of-slashes-in-url Based on these above features i want to build a machine learned model that will learn to rank/score the documents .i am trying to understand how to compute the features efficiently on the fly. Looking into the index and computing these features seems to be very slow . So for the time being i wanted to implement the same by looking into the TopK documents.Few of these features has to be computed on the fly and some of them are computed while indexing and stored in the index . I need to be able to look into all features to score/rank the final set of documents. Thanks , Pom.. On Sat, Apr 27, 2013 at 5:43 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : used to call the lucene IndexSearcher . As the documents are collected in : TopDocs in Lucene , before that is passed back to Nutch , i used to look : into the top K matching documents , consult some external repository : and further score the Top K documents and reorder them in the TopDocs array : . These reordered TopDocs is passed to Nutch . All these reordering code : was implemented by Extending the lucene IndexSearcher class . 1) that's basically the same info you provided before -- it still doesn't really tell us anything about what your current logic does with the top K documents and how/why/when you decide to reorder them or by how much -- details that are kind of important in being able to provide you with any meaningful advice on how to achieve your goal using hte plugin hooks available in Solr. 2) if you only care about re-ordering the Top K documents using some secret sauce, then i would suggest you just set rows=K and let Solr do it's thing, the post process the reuslts -- either in your client, or in a SearchComponent that modifies the SolrDocumentList produces by QueryComponent. : can you elaborate on what exactly your some logic involves? ... : https://people.apache.org/~hossman/#xyproblem : XY Problem : : Your question appears to be an XY Problem ... that is: you are dealing : with X, you are assuming Y will help you, and you are asking about Y : without giving more details about the X so that we can understand the : full issue. Perhaps the best solution doesn't involve Y at all? : See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss