date:20130426

Re: Using another way instead of DIH

2013-04-26 Thread Majirus FANSI

Hi,
It simply means the configuration file of your DIH.

Cheers


On 26 April 2013 03:37, xiaoqi belivexia...@gmail.com wrote:

 Thanks for help .

 data-config.xml ? i can not find this file , u mean data-import.xml or
 solrconfig.xml ?





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Using-another-way-instead-of-DIH-tp4058937p4059067.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr metrics in Codahale metrics and Graphite?

2013-04-26 Thread Dmitry Kan

Alan, Shawn,

If backporting to 3.x is hard, no worries, we don't necessarily require the
patch as we are heading to 4.x eventually. It is just much easier within
our organization to test on the existing solr 3.4 as there are a few of
internal dependencies and custom code on top of solr. Also solr upgrades on
production systems are usually pushed forward by a month or so starting the
upgrade on development systems (requires lots of testing and verifications).

Nevertheless, it is good effort to make #solr #graphite friendly, so keep
it up! :)

Dmitry




On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/25/2013 6:30 AM, Dmitry Kan wrote:
  We are very much interested in 3.4.
 
  On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk wrote:
  This is on top of trunk at the moment, but would be back ported to 4.4
 if
  there was interest.

 This will be bad news, I'm sorry:

 All remaining work on 3.x versions happens in the 3.6 branch. This
 branch is in maintenance mode.  It will only get fixes for serious bugs
 with no workaround.  Improvements and new features won't be considered
 at all.

 You're welcome to try backporting patches from newer issues.  Due to the
 major differences in the 3x and 4x codebases, the best case scenario is
 that you'll be facing a very manual task.  Some changes can't be
 backported because they rely on other features only found in 4.x code.

 Thanks,
 Shawn

Re: How do set compression for compression on stored fields in SOLR 4.2.1

2013-04-26 Thread William Bell

Why don't we add a parameter to allow non programmers to change it?

Compression=FAST|etc

On Thursday, April 25, 2013, Chris Hostetter wrote:

 : Subject: How do set compression for compression on stored fields in SOLR
 4.2.1
 :
 : https://issues.apache.org/jira/browse/LUCENE-4226
 : It mentions that we can set compression mode:
 : FAST, HIGH_COMPRESSION, FAST_UNCOMPRESSION.

 The compression details are hardcoded into the various codecs.  If you
 wanted to customize this, you'd need to write your own codec subclass...


 https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/codecs/compressing/class-use/CompressionMode.html

 See, for example, the implementations of Lucene41StoredFieldsFormat and
 Lucene42TermVectorsFormat...


 public final class Lucene41StoredFieldsFormat extends
 CompressingStoredFieldsFormat {
   /** Sole constructor. */
   public Lucene41StoredFieldsFormat() {
 super(Lucene41StoredFields, CompressionMode.FAST, 1  14);
   }
 }

 public final class Lucene42TermVectorsFormat extends
 CompressingTermVectorsFormat {
   /** Sole constructor. */
   public Lucene42TermVectorsFormat() {
 super(Lucene41StoredFields, , CompressionMode.FAST, 1  12);
   }
 }




 -Hoss



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Prons an Cons of Startup Lazy a Handler?

2013-04-26 Thread Furkan KAMACI

I will use SolrCloud and theis main purpose will be rich document indexing.
Solr example includes that definition:

requestHandler name=/update/extract
startup=lazyclass=solr.extraction.ExtractingRequestHandler 

it startups it lazy. So what is pros and cons for removing it for my
situation?

Lucene native facets

2013-04-26 Thread William Bell

Since facets are now included in Lucene, why don't we add a pass through
from Solr? The current facet code can live on but we could create new param
like facet.lucene=true?

Seems like a great enhancement !


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: what is the maximum XML file size to import?

2013-04-26 Thread Sharmila Thapa

Thanks to all for your suggestions. 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-the-maximum-XML-file-size-to-import-tp4058263p4059113.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr Indexing Rich Documents

2013-04-26 Thread Furkan KAMACI

I have a large corpus of rich documents i.e. pdf and doc files. I think
that I can use directly the example jar of Solr. However for a real time
environment what should I care? Also how do you send such kind of documents
into Solr to index, I think post.jar does not handle that file type?  I
should mention that I don't store documents in a database.

Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI

I use Solr 4.2.1 and these are my fields:

field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=text type=text_general indexed=true stored=true/


!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them. Some metadata is parsed from the documents,
but there are some which come from the client context:
content_type: From the HTTP headers of incoming stream
resourcename: From SolrCell request param resource.name
--
field name=title type=text_general indexed=true stored=true
multiValued=true/
field name=subject type=text_general indexed=true stored=true/
field name=description type=text_general indexed=true stored=true/
field name=comments type=text_general indexed=true stored=true/
field name=author type=text_general indexed=true stored=true/
field name=keywords type=text_general indexed=true stored=true/
field name=category type=text_general indexed=true stored=true/
field name=resourcename type=text_general indexed=true
stored=true/
field name=url type=text_general indexed=true stored=true/
field name=content_type type=string indexed=true stored=true
multiValued=true/
field name=last_modified type=date indexed=true stored=true/
field name=links type=string indexed=true stored=true
multiValued=true/

!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to
text
using copyField below. This is to save space. Use this field for returning
and
highlighting document content. Use the text field to search the content.
--
field name=content type=text_general indexed=false stored=true
multiValued=true/


!-- catchall field, containing all other searchable text fields
(implemented
via copyField further on in this schema --
!--
field name=text type=text_general indexed=true stored=false
multiValued=true/
--
!-- catchall text field that indexes tokens both normally and in reverse
for efficient
leading wildcard queries. --
field name=text_rev type=text_general_rev indexed=true
stored=false multiValued=true/

!-- non-tokenized version of manufacturer to make it easier to sort or
group
results by manufacturer. copied from manu via copyField --
field name=manu_exact type=string indexed=true stored=false/

field name=payloads type=payloads indexed=true stored=true/

field name=_version_ type=long indexed=true stored=true/

I run that command:

java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
523387.pdf

However I get that error, any ideas?

Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Raymond Wiker

You could start by doing

java post.jar -help

--- the 7th example shows exactly what you need to do to add a document id.

On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 I use Solr 4.2.1 and these are my fields:

 field name=id type=string indexed=true stored=true required=true
 multiValued=false /
 field name=text type=text_general indexed=true stored=true/


 !-- Common metadata fields, named specifically to match up with
 SolrCell metadata when parsing rich documents such as Word, PDF.
 Some fields are multiValued only because Tika currently may return
 multiple values for them. Some metadata is parsed from the documents,
 but there are some which come from the client context:
 content_type: From the HTTP headers of incoming stream
 resourcename: From SolrCell request param resource.name
 --
 field name=title type=text_general indexed=true stored=true
 multiValued=true/
 field name=subject type=text_general indexed=true stored=true/
 field name=description type=text_general indexed=true
 stored=true/
 field name=comments type=text_general indexed=true stored=true/
 field name=author type=text_general indexed=true stored=true/
 field name=keywords type=text_general indexed=true stored=true/
 field name=category type=text_general indexed=true stored=true/
 field name=resourcename type=text_general indexed=true
 stored=true/
 field name=url type=text_general indexed=true stored=true/
 field name=content_type type=string indexed=true stored=true
 multiValued=true/
 field name=last_modified type=date indexed=true stored=true/
 field name=links type=string indexed=true stored=true
 multiValued=true/

 !-- Main body of document extracted by SolrCell.
 NOTE: This field is not indexed by default, since it is also copied to
 text
 using copyField below. This is to save space. Use this field for returning
 and
 highlighting document content. Use the text field to search the content.
 --
 field name=content type=text_general indexed=false stored=true
 multiValued=true/


 !-- catchall field, containing all other searchable text fields
 (implemented
 via copyField further on in this schema --
 !--
 field name=text type=text_general indexed=true stored=false
 multiValued=true/
 --
 !-- catchall text field that indexes tokens both normally and in reverse
 for efficient
 leading wildcard queries. --
 field name=text_rev type=text_general_rev indexed=true
 stored=false multiValued=true/

 !-- non-tokenized version of manufacturer to make it easier to sort or
 group
 results by manufacturer. copied from manu via copyField --
 field name=manu_exact type=string indexed=true stored=false/

 field name=payloads type=payloads indexed=true stored=true/

 field name=_version_ type=long indexed=true stored=true/

 I run that command:

 java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
 523387.pdf

 However I get that error, any ideas?

 Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory
 uniqueKey field: id
 at

 org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
 at

 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
 at

 org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
 at

 org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
 at

 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
 at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at

 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
 at

 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
 at

Re: Using another way instead of DIH

2013-04-26 Thread xiaoqi


below is my data-import.xml

any suggestion ? 



?xml version=1.0 encoding=UTF-8?
dataConfig
dataSource type=JdbcDataSource
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://*:3306/guang
user=guang
password=guang/

document name=products

entity name=item pk=id query=SELECT a.*,d.* FROM item a LEFT
JOIN item_ctr_by_area d ON a.id=d.item_id LEFT JOIN shop b ON a.shop_id =
b.id WHERE a.status =1 AND b.status = 1 AND b.uctrac_status =0 AND
uctrac_adgroup_id IS NOT NULL
field column=id name=item_id /
field column=title name=item_title /
field column=description name=item_description /
field column=price name=item_price /
field column=promotion name=item_promotion /
field column=pic_url name=item_picurl /
field column=local_pic_url name=item_local_picurl /
field column=detail_url name=item_detailurl /
field column=recommend_value name=item_recommend_value/
field column=uctrac_adgroup_id name=uctrac_adgroup_id/
field column=uctrac_price name=uctrac_adgroup_price/
field column=uctrac_status name=uctrac_adgroup_status/
field column=uctrac_creative_id name=uctrac_creative_id/
field column=lctr name=item_lctr/
field column=CTR_ALL name=region_ctr_all/
field column=CTR_N name=region_ctr_n/
field column=CTR_MN name=region_ctr_mn/
field column=CTR_MS name=region_ctr_ms/
field column=CTR_S name=region_ctr_s/
field column=CTR_011100 name=region_ctr_0111/
field column=CTR_011300 name=region_ctr_0113/
field column=CTR_012100 name=region_ctr_0121/
field column=CTR_013100 name=region_ctr_0131/
field column=CTR_013200 name=region_ctr_0132/
field column=CTR_013300 name=region_ctr_0133/
field column=CTR_013400 name=region_ctr_0134/
field column=CTR_013500 name=region_ctr_0135/
field column=CTR_013700 name=region_ctr_0137/
field column=CTR_014100 name=region_ctr_0141/
field column=CTR_014200 name=region_ctr_0142/
field column=CTR_014300 name=region_ctr_0143/
field column=CTR_014400 name=region_ctr_0144/
field column=CTR_015100 name=region_ctr_0151/
field column=CTR_016100 name=region_ctr_0161/
field column=CTR_ALL_2 name=region_ctr_all_2/
field column=CTR_N_2 name=region_ctr_n_2/
field column=CTR_MN_2 name=region_ctr_mn_2/
field column=CTR_MS_2 name=region_ctr_ms_2/
field column=CTR_S_2 name=region_ctr_s_2/
field column=CTR_011100_2 name=region_ctr_0111_2/
field column=CTR_011300_2 name=region_ctr_0113_2/
field column=CTR_012100_2 name=region_ctr_0121_2/
field column=CTR_013100_2 name=region_ctr_0131_2/
field column=CTR_013200_2 name=region_ctr_0132_2/
field column=CTR_013300_2 name=region_ctr_0133_2/
field column=CTR_013400_2 name=region_ctr_0134_2/
field column=CTR_013500_2 name=region_ctr_0135_2/
field column=CTR_013700_2 name=region_ctr_0137_2/
field column=CTR_014100_2 name=region_ctr_0141_2/
field column=CTR_014200_2 name=region_ctr_0142_2/
field column=CTR_014300_2 name=region_ctr_0143_2/
field column=CTR_014400_2 name=region_ctr_0144_2/
field column=CTR_015100_2 name=region_ctr_0151_2/
field column=CTR_016100_2 name=region_ctr_0161_2/
field column=CTR_ALL_4 name=region_ctr_all_4/
field column=CTR_N_4 name=region_ctr_n_4/
field column=CTR_MN_4 name=region_ctr_mn_4/
field column=CTR_MS_4 name=region_ctr_ms_4/
field column=CTR_S_4 name=region_ctr_s_4/
field column=CTR_011100_4 name=region_ctr_0111_4/
field column=CTR_011300_4 name=region_ctr_0113_4/
field column=CTR_012100_4 name=region_ctr_0121_4/
field column=CTR_013100_4 name=region_ctr_0131_4/
field column=CTR_013200_4 name=region_ctr_0132_4/
field column=CTR_013300_4 name=region_ctr_0133_4/
field column=CTR_013400_4 name=region_ctr_0134_4/
field column=CTR_013500_4 name=region_ctr_0135_4/
field column=CTR_013700_4 name=region_ctr_0137_4/
field column=CTR_014100_4 name=region_ctr_0141_4/
field column=CTR_014200_4 name=region_ctr_0142_4/
field column=CTR_014300_4 name=region_ctr_0143_4/
field column=CTR_014400_4 name=region_ctr_0144_4/
field column=CTR_015100_4 name=region_ctr_0151_4/
field column=CTR_016100_4 name=region_ctr_0161_4/
field column=votescore

Re: [solr 3.4] anomaly during distributed facet query with 102 shards

2013-04-26 Thread Dmitry Kan

Hi,

1. Ruled out possibility to test 4.2.1 router against 3.4 shard farm for
obvious reasons (java.lang.RuntimeException: Invalid version (expected 2,
but 60) or the data in not in 'javabin' format).

2. Tried jetty, but same result.


On Thu, Apr 25, 2013 at 5:16 PM, Dmitry Kan solrexp...@gmail.com wrote:

 Thanks, Yonik. Yes, I supposed that. We are in the pre-release phase, so
 we have the pressure.

 Solr 3.4.

 Would setting up 4.2.1 router work with 3.4 shards?
 On 25 Apr 2013 17:11, Yonik Seeley yo...@lucidworks.com wrote:

 On Thu, Apr 25, 2013 at 8:32 AM, Dmitry Kan solrexp...@gmail.com wrote:
  Are there any distrib facet gurus on the list? I would be ready to try
  sensible ideas, including on the source code level, if someone of you
 could
  give me a hand.

 The Lucene/Solr Revolution conference is coming up next week, so I
 think many are busy creating their presentations.
 What version of Solr are you using?  Have you tried using a newer
 version?  Is it reproducable with a smaller cluster?  If so, you could
 try using the included Jetty server instead of Tomcat to rule out that
 factor.

 -Yonik
 http://lucidworks.com

Log Monitor System for SolrCloud and Logging to log4j at SolrCloud?

2013-04-26 Thread Furkan KAMACI

I want to use GrayLog2 to monitor my logging files for SolrCloud. However I
think that GrayLog2 works with log4j and logback. Solr uses slf4j.
How can I solve this problem and what logging monitoring system does folks
use?

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI

Hi Raymond;

Now I get that error: SimplePostTool: WARNING: IOException while reading
response: java.io.FileNotFoundException:

2013/4/26 Raymond Wiker rwi...@gmail.com

 You could start by doing

 java post.jar -help

 --- the 7th example shows exactly what you need to do to add a document id.

 On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:

  I use Solr 4.2.1 and these are my fields:
 
  field name=id type=string indexed=true stored=true
 required=true
  multiValued=false /
  field name=text type=text_general indexed=true stored=true/
 
 
  !-- Common metadata fields, named specifically to match up with
  SolrCell metadata when parsing rich documents such as Word, PDF.
  Some fields are multiValued only because Tika currently may return
  multiple values for them. Some metadata is parsed from the documents,
  but there are some which come from the client context:
  content_type: From the HTTP headers of incoming stream
  resourcename: From SolrCell request param resource.name
  --
  field name=title type=text_general indexed=true stored=true
  multiValued=true/
  field name=subject type=text_general indexed=true stored=true/
  field name=description type=text_general indexed=true
  stored=true/
  field name=comments type=text_general indexed=true stored=true/
  field name=author type=text_general indexed=true stored=true/
  field name=keywords type=text_general indexed=true stored=true/
  field name=category type=text_general indexed=true stored=true/
  field name=resourcename type=text_general indexed=true
  stored=true/
  field name=url type=text_general indexed=true stored=true/
  field name=content_type type=string indexed=true stored=true
  multiValued=true/
  field name=last_modified type=date indexed=true stored=true/
  field name=links type=string indexed=true stored=true
  multiValued=true/
 
  !-- Main body of document extracted by SolrCell.
  NOTE: This field is not indexed by default, since it is also copied to
  text
  using copyField below. This is to save space. Use this field for
 returning
  and
  highlighting document content. Use the text field to search the
 content.
  --
  field name=content type=text_general indexed=false stored=true
  multiValued=true/
 
 
  !-- catchall field, containing all other searchable text fields
  (implemented
  via copyField further on in this schema --
  !--
  field name=text type=text_general indexed=true stored=false
  multiValued=true/
  --
  !-- catchall text field that indexes tokens both normally and in reverse
  for efficient
  leading wildcard queries. --
  field name=text_rev type=text_general_rev indexed=true
  stored=false multiValued=true/
 
  !-- non-tokenized version of manufacturer to make it easier to sort or
  group
  results by manufacturer. copied from manu via copyField --
  field name=manu_exact type=string indexed=true stored=false/
 
  field name=payloads type=payloads indexed=true stored=true/
 
  field name=_version_ type=long indexed=true stored=true/
 
  I run that command:
 
  java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
  523387.pdf
 
  However I get that error, any ideas?
 
  Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
  SEVERE: org.apache.solr.common.SolrException: Document is missing
 mandatory
  uniqueKey field: id
  at
 
 
 org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
  at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
  at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
  at
 
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
  at
 
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
  at
 
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
  at
 
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
  at
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
  at
 
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
  at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Jan Høydahl

http://wiki.apache.org/solr/post.jar

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com:

 Hi Raymond;
 
 Now I get that error: SimplePostTool: WARNING: IOException while reading
 response: java.io.FileNotFoundException:
 
 2013/4/26 Raymond Wiker rwi...@gmail.com
 
 You could start by doing
 
 java post.jar -help
 
 --- the 7th example shows exactly what you need to do to add a document id.
 
 On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com
 wrote:
 
 I use Solr 4.2.1 and these are my fields:
 
 field name=id type=string indexed=true stored=true
 required=true
 multiValued=false /
 field name=text type=text_general indexed=true stored=true/
 
 
 !-- Common metadata fields, named specifically to match up with
 SolrCell metadata when parsing rich documents such as Word, PDF.
 Some fields are multiValued only because Tika currently may return
 multiple values for them. Some metadata is parsed from the documents,
 but there are some which come from the client context:
 content_type: From the HTTP headers of incoming stream
 resourcename: From SolrCell request param resource.name
 --
 field name=title type=text_general indexed=true stored=true
 multiValued=true/
 field name=subject type=text_general indexed=true stored=true/
 field name=description type=text_general indexed=true
 stored=true/
 field name=comments type=text_general indexed=true stored=true/
 field name=author type=text_general indexed=true stored=true/
 field name=keywords type=text_general indexed=true stored=true/
 field name=category type=text_general indexed=true stored=true/
 field name=resourcename type=text_general indexed=true
 stored=true/
 field name=url type=text_general indexed=true stored=true/
 field name=content_type type=string indexed=true stored=true
 multiValued=true/
 field name=last_modified type=date indexed=true stored=true/
 field name=links type=string indexed=true stored=true
 multiValued=true/
 
 !-- Main body of document extracted by SolrCell.
 NOTE: This field is not indexed by default, since it is also copied to
 text
 using copyField below. This is to save space. Use this field for
 returning
 and
 highlighting document content. Use the text field to search the
 content.
 --
 field name=content type=text_general indexed=false stored=true
 multiValued=true/
 
 
 !-- catchall field, containing all other searchable text fields
 (implemented
 via copyField further on in this schema --
 !--
 field name=text type=text_general indexed=true stored=false
 multiValued=true/
 --
 !-- catchall text field that indexes tokens both normally and in reverse
 for efficient
 leading wildcard queries. --
 field name=text_rev type=text_general_rev indexed=true
 stored=false multiValued=true/
 
 !-- non-tokenized version of manufacturer to make it easier to sort or
 group
 results by manufacturer. copied from manu via copyField --
 field name=manu_exact type=string indexed=true stored=false/
 
 field name=payloads type=payloads indexed=true stored=true/
 
 field name=_version_ type=long indexed=true stored=true/
 
 I run that command:
 
 java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
 523387.pdf
 
 However I get that error, any ideas?
 
 Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Document is missing
 mandatory
 uniqueKey field: id
 at
 
 
 org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
 at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
 at
 
 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
 at
 
 
 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
 at
 
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
 at
 
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
 at
 
 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
 at
 
 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
 
 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
 at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
 at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
 at
 
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
 at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
 at

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI

If you can help me it would be nice. I get that error:

SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update/extract..
Entering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file 523387.pdf (application/pdf)
SimplePostTool: WARNING: Solr returned an error #404 Not Found
SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
http://localhost:8983/solr/update/extract/extract?resource.name=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdfliteral.id=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdf
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update/extract..
Disconnected from the target VM, address: '127.0.0.1:58385', transport:
'socket'
Time spent: 0:00:00.194

and there is nothing indexed. Here is my server log:

Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
 
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592;
maxCacheMB=48.0
maxMergeSizeMB=4.0),segFN=segments_c,generation=12,filenames=[segments_c]
 
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592;
maxCacheMB=48.0
maxMergeSizeMB=4.0),segFN=segments_d,generation=13,filenames=[segments_d]
Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: newest commit = 13[segments_d]
Apr 26, 2013 2:55:58 PM org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@37342445 main
Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to
Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)}
Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [collection1] Registered new searcher
Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)}
Apr 26, 2013 2:55:58 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [collection1] webapp=/solr path=/update/extract params={commit=true}
{commit=} 0 156





2013/4/26 Jan Høydahl jan@cominvent.com

 http://wiki.apache.org/solr/post.jar

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com:

  Hi Raymond;
 
  Now I get that error: SimplePostTool: WARNING: IOException while reading
  response: java.io.FileNotFoundException:
 
  2013/4/26 Raymond Wiker rwi...@gmail.com
 
  You could start by doing
 
  java post.jar -help
 
  --- the 7th example shows exactly what you need to do to add a document
 id.
 
  On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.com
  wrote:
 
  I use Solr 4.2.1 and these are my fields:
 
  field name=id type=string indexed=true stored=true
  required=true
  multiValued=false /
  field name=text type=text_general indexed=true stored=true/
 
 
  !-- Common metadata fields, named specifically to match up with
  SolrCell metadata when parsing rich documents such as Word, PDF.
  Some fields are multiValued only because Tika currently may return
  multiple values for them. Some metadata is parsed from the documents,
  but there are some which come from the client context:
  content_type: From the HTTP headers of incoming stream
  resourcename: From SolrCell request param resource.name
  --
  field name=title type=text_general indexed=true stored=true
  multiValued=true/
  field name=subject type=text_general indexed=true
 stored=true/
  field name=description type=text_general indexed=true
  stored=true/
  field name=comments type=text_general indexed=true
 stored=true/
  field name=author type=text_general indexed=true stored=true/
  field name=keywords type=text_general indexed=true
 stored=true/
  field name=category type=text_general indexed=true
 stored=true/
  field name=resourcename type=text_general indexed=true
  stored=true/
  field name=url type=text_general indexed=true stored=true/
  field name=content_type type=string indexed=true stored=true
  multiValued=true/
  field

Re: Solr Indexing Rich Documents

2013-04-26 Thread Jack Krupansky

It's called SolrCell or the ExtractingRequestHandler (/update/extract), 
which the newer post.jar knows to use for some file types:

http://wiki.apache.org/solr/ExtractingRequestHandler

-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Friday, April 26, 2013 4:48 AM
To: solr-user@lucene.apache.org
Subject: Solr Indexing Rich Documents

I have a large corpus of rich documents i.e. pdf and doc files. I think
that I can use directly the example jar of Solr. However for a real time
environment what should I care? Also how do you send such kind of documents
into Solr to index, I think post.jar does not handle that file type?  I
should mention that I don't store documents in a database.

Re: Lucene native facets

2013-04-26 Thread Jack Krupansky

Sure, but they are completely different conceptual models of faceting - Solr 
is dynamic, based on the actual data for the hierarchy, while Lucene is 
static, based on a predefined taxonomy that must be meticulously created 
before any data is added.


Solr answers the question: what structure does your data have, while Lucene 
answers the question how does your data fit into a predefined structure. 
Both are valid and valuable questions, but they are still rather distinct.


Yes, Solr should provide support for static facet taxonomies, but what 
exactly that would look like... has not even been proposed yet, yet alone as 
simple as facet.lucene=true.


OTOH, maybe most of the work may be simply to add taxonomy management to 
Solr (as a passthrough to the Lucene features), and then maybe a lot of the 
existing Solr facet parameters simply need parallel Lucene-oriented 
implementations.


But, the other half of Solr facets is how filter queries are used for 
selecting facets. That's all done at the application level, so it can't be 
hidden from the app so easily. Maybe a new Solr facet filter API can be 
developed that can then in turn have Solr facet vs. Lucene facet 
implementations. Or, maybe a new dynamic facet Lucene API could be added as 
well, so that Solr facets in fact become a passthrough as well.


Still, it would be good to support Lucene facets in Solr. Maybe that could 
be one of the key turning points for what defines Lucene/Solr 5.0.


Is there a Jira for this? I don't recall one.

-- Jack Krupansky

-Original Message- 
From: William Bell

Sent: Friday, April 26, 2013 4:01 AM
To: solr-user@lucene.apache.org
Subject: Lucene native facets

Since facets are now included in Lucene, why don't we add a pass through
from Solr? The current facet code can live on but we could create new param
like facet.lucene=true?

Seems like a great enhancement !


--
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: Prons an Cons of Startup Lazy a Handler?

2013-04-26 Thread Jack Krupansky

Lazy startup simply means that you are willing to tolerate a slight delay on 
the first request to that request handler.


It also has the side effect that if there are any problems with starting up 
the handler, they won't be seen until that first request.


In short, whether you want to keep the handler is completely independent of 
the lazy startup option.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Friday, April 26, 2013 4:01 AM
To: solr-user@lucene.apache.org
Subject: Prons an Cons of Startup Lazy a Handler?

I will use SolrCloud and theis main purpose will be rich document indexing.
Solr example includes that definition:

requestHandler name=/update/extract
startup=lazyclass=solr.extraction.ExtractingRequestHandler 

it startups it lazy. So what is pros and cons for removing it for my
situation?

Re: Solr Indexing Rich Documents

2013-04-26 Thread Furkan KAMACI

Thanks for the answer, I get an error now: FileNotFound Exception as I
mentioned at other thread. Now I' trying to solve it.

2013/4/26 Jack Krupansky j...@basetechnology.com

 It's called SolrCell or the ExtractingRequestHandler (/update/extract),
 which the newer post.jar knows to use for some file types:
 http://wiki.apache.org/solr/ExtractingRequestHandler

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Friday, April 26, 2013 4:48 AM
 To: solr-user@lucene.apache.org
 Subject: Solr Indexing Rich Documents


 I have a large corpus of rich documents i.e. pdf and doc files. I think
 that I can use directly the example jar of Solr. However for a real time
 environment what should I care? Also how do you send such kind of documents
 into Solr to index, I think post.jar does not handle that file type?  I
 should mention that I don't store documents in a database.

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI

I think that I should start a new thread for my question to help people who
searches for same situation.

2013/4/26 Furkan KAMACI furkankam...@gmail.com

 If you can help me it would be nice. I get that error:

 SimplePostTool version 1.5
 Posting files to base url http://localhost:8983/solr/update/extract..
 Entering auto mode. File endings considered are
 xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
 POSTing file 523387.pdf (application/pdf)
 SimplePostTool: WARNING: Solr returned an error #404 Not Found
 SimplePostTool: WARNING: IOException while reading response:
 java.io.FileNotFoundException:
 http://localhost:8983/solr/update/extract/extract?resource.name=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdfliteral.id=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdf
 1 files indexed.
 COMMITting Solr index changes to http://localhost:8983/solr/update/extract
 ..
 Disconnected from the target VM, address: '127.0.0.1:58385', transport:
 'socket'
 Time spent: 0:00:00.194

 and there is nothing indexed. Here is my server log:

 Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
 Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy onCommit
 INFO: SolrDeletionPolicy.onCommit: commits:num=2
  
 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592;
 maxCacheMB=48.0
 maxMergeSizeMB=4.0),segFN=segments_c,generation=12,filenames=[segments_c]
  
 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index
 lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592;
 maxCacheMB=48.0
 maxMergeSizeMB=4.0),segFN=segments_d,generation=13,filenames=[segments_d]
 Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: newest commit = 13[segments_d]
 Apr 26, 2013 2:55:58 PM org.apache.solr.search.SolrIndexSearcher init
 INFO: Opening Searcher@37342445 main
 Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: end_commit_flush
 Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener
 newSearcher
 INFO: QuerySenderListener sending requests to 
 Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)}
 Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener
 newSearcher
 INFO: QuerySenderListener done.
 Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrCore registerSearcher
 INFO: [collection1] Registered new searcher 
 Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)}
 Apr 26, 2013 2:55:58 PM
 org.apache.solr.update.processor.LogUpdateProcessor finish
 INFO: [collection1] webapp=/solr path=/update/extract params={commit=true}
 {commit=} 0 156





 2013/4/26 Jan Høydahl jan@cominvent.com

 http://wiki.apache.org/solr/post.jar

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com:

  Hi Raymond;
 
  Now I get that error: SimplePostTool: WARNING: IOException while reading
  response: java.io.FileNotFoundException:
 
  2013/4/26 Raymond Wiker rwi...@gmail.com
 
  You could start by doing
 
  java post.jar -help
 
  --- the 7th example shows exactly what you need to do to add a
 document id.
 
  On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI 
 furkankam...@gmail.com
  wrote:
 
  I use Solr 4.2.1 and these are my fields:
 
  field name=id type=string indexed=true stored=true
  required=true
  multiValued=false /
  field name=text type=text_general indexed=true stored=true/
 
 
  !-- Common metadata fields, named specifically to match up with
  SolrCell metadata when parsing rich documents such as Word, PDF.
  Some fields are multiValued only because Tika currently may return
  multiple values for them. Some metadata is parsed from the documents,
  but there are some which come from the client context:
  content_type: From the HTTP headers of incoming stream
  resourcename: From SolrCell request param resource.name
  --
  field name=title type=text_general indexed=true stored=true
  multiValued=true/
  field name=subject type=text_general indexed=true
 stored=true/
  field name=description type=text_general indexed=true
  stored=true/
  field name=comments type=text_general indexed=true
 stored=true/
  field name=author type=text_general indexed=true
 stored=true/
  field name=keywords type=text_general indexed=true
 stored=true/
  field name=category type=text_general indexed=true
 stored=true/
  field

SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException

2013-04-26 Thread Furkan KAMACI

Could anybody help me  for my error. When I try to post documents with
post.jar I get that error:

SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update/extract..
Entering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file 523387.pdf (application/pdf)
SimplePostTool: WARNING: Solr returned an error #404 Not Found
*SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:*
http://localhost:8983/solr/update/extract/extract?resource.name=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdfliteral.id=%2Fhome%2Fll%2FDesktop%2Fb%2Flucene-solr-lucene_solr_4_2_1%2Fsolr%2Fexample%2Fexampledocs%2F523387.pdf
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update/extract..
Disconnected from the target VM, address: '127.0.0.1:58385', transport:
'socket'
Time spent: 0:00:00.194

and there is nothing indexed. Here is my server log:

Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy onCommit
INFO: SolrDeletionPolicy.onCommit: commits:num=2
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592;
maxCacheMB=48.0
maxMergeSizeMB=4.0),segFN=segments_c,generation=12,filenames=[segments_c]
commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/solr/collection1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@386b8592;
maxCacheMB=48.0
maxMergeSizeMB=4.0),segFN=segments_d,generation=13,filenames=[segments_d]
Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrDeletionPolicy
updateCommits
INFO: newest commit = 13[segments_d]
Apr 26, 2013 2:55:58 PM org.apache.solr.search.SolrIndexSearcher init
INFO: Opening Searcher@37342445 main
Apr 26, 2013 2:55:58 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener sending requests to
Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)}
Apr 26, 2013 2:55:58 PM org.apache.solr.core.QuerySenderListener newSearcher
INFO: QuerySenderListener done.
Apr 26, 2013 2:55:58 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [collection1] Registered new searcher
Searcher@37342445main{StandardDirectoryReader(segments_2:1:nrt)}
Apr 26, 2013 2:55:58 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: [collection1] webapp=/solr path=/update/extract params={commit=true}
{commit=} 0 156


I use that command to post:
java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar
523387.pdf

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Jack Krupansky

Maybe you are confusing things by mixing instructions - there are SEPARATE 
instructions for directly using SolrCell and implicitly using it via 
post.jar. Pick which you want and stick with it. DO NOT MIX the 
instructions.


You wrote:  I run that command: 
java -Durl=http://localhost:8983/solr/update/extract -jar post.jar 
523387.pdf


Was there a GOOD reason that you chose that URL?

Best to stay with what the post.jar wiki recommends:

Post all CSV, XML, JSON and PDF documents using AUTO mode which detects type 
based on file name:


java -Dauto -jar post.jar *.csv *.xml *.json *.pdf

Or, stick with SolrCell directly, but follow its distinct instructions:
http://wiki.apache.org/solr/ExtractingRequestHandler

Again, DO NOT MIX the instructions from the two.

post.jar is designed so that you do not need to know or care exactly how 
rich document indexing works.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Friday, April 26, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Document is missing mandatory uniqueKey field: id for Solr PDF 
indexing


I use Solr 4.2.1 and these are my fields:

field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=text type=text_general indexed=true stored=true/


!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them. Some metadata is parsed from the documents,
but there are some which come from the client context:
content_type: From the HTTP headers of incoming stream
resourcename: From SolrCell request param resource.name
--
field name=title type=text_general indexed=true stored=true
multiValued=true/
field name=subject type=text_general indexed=true stored=true/
field name=description type=text_general indexed=true stored=true/
field name=comments type=text_general indexed=true stored=true/
field name=author type=text_general indexed=true stored=true/
field name=keywords type=text_general indexed=true stored=true/
field name=category type=text_general indexed=true stored=true/
field name=resourcename type=text_general indexed=true
stored=true/
field name=url type=text_general indexed=true stored=true/
field name=content_type type=string indexed=true stored=true
multiValued=true/
field name=last_modified type=date indexed=true stored=true/
field name=links type=string indexed=true stored=true
multiValued=true/

!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to
text
using copyField below. This is to save space. Use this field for returning
and
highlighting document content. Use the text field to search the content.
--
field name=content type=text_general indexed=false stored=true
multiValued=true/


!-- catchall field, containing all other searchable text fields
(implemented
via copyField further on in this schema --
!--
field name=text type=text_general indexed=true stored=false
multiValued=true/
--
!-- catchall text field that indexes tokens both normally and in reverse
for efficient
leading wildcard queries. --
field name=text_rev type=text_general_rev indexed=true
stored=false multiValued=true/

!-- non-tokenized version of manufacturer to make it easier to sort or
group
results by manufacturer. copied from manu via copyField --
field name=manu_exact type=string indexed=true stored=false/

field name=payloads type=payloads indexed=true stored=true/

field name=_version_ type=long indexed=true stored=true/

I run that command:

java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
523387.pdf

However I get that error, any ideas?

Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory
uniqueKey field: id
at
org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
at

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI

Jack, thanks for your answers. Ok, when I remove -Durl parameter I think it
works, thanks. However I think that I have a problem with my schema. I get
that error:

Apr 26, 2013 3:52:21 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR:
[doc=/home/ll/Desktop/b/lucene-solr-lucene_solr_4_2_1/solr/example/exampledocs/523387.pdf]
multiple values encountered for non multiValued copy field text:
application/pdf


2013/4/26 Jack Krupansky j...@basetechnology.com

 Maybe you are confusing things by mixing instructions - there are SEPARATE
 instructions for directly using SolrCell and implicitly using it via
 post.jar. Pick which you want and stick with it. DO NOT MIX the
 instructions.

 You wrote:  I run that command: java -Durl=
 http://localhost:8983/solr/update/extract -jar post.jar 523387.pdf

 Was there a GOOD reason that you chose that URL?

 Best to stay with what the post.jar wiki recommends:

 Post all CSV, XML, JSON and PDF documents using AUTO mode which detects
 type based on file name:

 java -Dauto -jar post.jar *.csv *.xml *.json *.pdf

 Or, stick with SolrCell directly, but follow its distinct instructions:
 http://wiki.apache.org/solr/ExtractingRequestHandler

 Again, DO NOT MIX the instructions from the two.

 post.jar is designed so that you do not need to know or care exactly how
 rich document indexing works.

 -- Jack Krupansky

 -Original Message- From: Furkan KAMACI
 Sent: Friday, April 26, 2013 5:30 AM
 To: solr-user@lucene.apache.org
 Subject: Document is missing mandatory uniqueKey field: id for Solr PDF
 indexing


 I use Solr 4.2.1 and these are my fields:

 field name=id type=string indexed=true stored=true required=true
 multiValued=false /
 field name=text type=text_general indexed=true stored=true/


 !-- Common metadata fields, named specifically to match up with
 SolrCell metadata when parsing rich documents such as Word, PDF.
 Some fields are multiValued only because Tika currently may return
 multiple values for them. Some metadata is parsed from the documents,
 but there are some which come from the client context:
 content_type: From the HTTP headers of incoming stream
 resourcename: From SolrCell request param resource.name
 --
 field name=title type=text_general indexed=true stored=true
 multiValued=true/
 field name=subject type=text_general indexed=true stored=true/
 field name=description type=text_general indexed=true
 stored=true/
 field name=comments type=text_general indexed=true stored=true/
 field name=author type=text_general indexed=true stored=true/
 field name=keywords type=text_general indexed=true stored=true/
 field name=category type=text_general indexed=true stored=true/
 field name=resourcename type=text_general indexed=true
 stored=true/
 field name=url type=text_general indexed=true stored=true/
 field name=content_type type=string indexed=true stored=true
 multiValued=true/
 field name=last_modified type=date indexed=true stored=true/
 field name=links type=string indexed=true stored=true
 multiValued=true/

 !-- Main body of document extracted by SolrCell.
 NOTE: This field is not indexed by default, since it is also copied to
 text
 using copyField below. This is to save space. Use this field for returning
 and
 highlighting document content. Use the text field to search the content.
 --
 field name=content type=text_general indexed=false stored=true
 multiValued=true/


 !-- catchall field, containing all other searchable text fields
 (implemented
 via copyField further on in this schema --
 !--
 field name=text type=text_general indexed=true stored=false
 multiValued=true/
 --
 !-- catchall text field that indexes tokens both normally and in reverse
 for efficient
 leading wildcard queries. --
 field name=text_rev type=text_general_rev indexed=true
 stored=false multiValued=true/

 !-- non-tokenized version of manufacturer to make it easier to sort or
 group
 results by manufacturer. copied from manu via copyField --
 field name=manu_exact type=string indexed=true stored=false/

 field name=payloads type=payloads indexed=true stored=true/

 field name=_version_ type=long indexed=true stored=true/

 I run that command:

 java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
 523387.pdf

 However I get that error, any ideas?

 Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: Document is missing mandatory
 uniqueKey field: id
 at

 org.apache.solr.update.AddUpdateCommand.getIndexedId(AddUpdateCommand.java:88)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:464)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:346)
 at

 org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
 at

uniqueKey required false for multivalued id when indexing rich documents

2013-04-26 Thread Furkan KAMACI

I am new to Solr and try to index rich files. I have defined that at my
schema:

field name=id type=string indexed=true stored=true required=true
multiValued=false /

and there is a line at my schema:

uniqueKeyid/uniqueKey

should I make it like that:

uniqueKey required=false/uniqueKey

for my purpose?

How to define a generic field to hold all undefined fields

2013-04-26 Thread Furkan KAMACI

I sen some documents to my Solr to be indexed. However I get such kind of
errors:

ERROR: [doc=0579B002] unknown field 'name'

I know that I should define a field named 'name' at mu schema. However
there maybe many of fields like that. How can I define a generic field that
holds all non defined values or maybe how can I ignore them?

Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException

2013-04-26 Thread Raymond Wiker

On Fri, Apr 26, 2013 at 2:45 PM, Furkan KAMACI furkankam...@gmail.comwrote:


 I use that command to post:
 java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar
 523387.pdf


I think you need to have the collection name in the url... something like
http://localhost:8983/solr/mycollection/update/extract .

Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException

2013-04-26 Thread Gora Mohanty

On 26 April 2013 18:15, Furkan KAMACI furkankam...@gmail.com wrote:
 Could anybody help me  for my error. When I try to post documents with
 post.jar I get that error:
[...]
 I use that command to post:
 java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar post.jar
 523387.pdf

The URL should be http://localhost:8983/solr/update . You have an
extra /extract . Actually, if you are running from embedded Jetty, you
should be able to skip the -Durl argument.

Regards,
Gora

Re: uniqueKey required false for multivalued id when indexing rich documents

2013-04-26 Thread Gora Mohanty

On 26 April 2013 18:38, Furkan KAMACI furkankam...@gmail.com wrote:
 I am new to Solr and try to index rich files. I have defined that at my
 schema:
[...]
 uniqueKey required=false/uniqueKey

This will not work: Please see http://wiki.apache.org/solr/UniqueKey
for different use cases for the uniqueKey.

For documents, I usually use the document name, or some segment
of the filesystem path as the uniqueKey as that is automatically
guaranteed to be unique.

Regards,
Gora

Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException

2013-04-26 Thread Furkan KAMACI

Ok, solved

2013/4/26 Raymond Wiker rwi...@gmail.com

 On Fri, Apr 26, 2013 at 2:45 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 
  I use that command to post:
  java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar
 post.jar
  523387.pdf
 

 I think you need to have the collection name in the url... something like
 http://localhost:8983/solr/mycollection/update/extract .

Re: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException

2013-04-26 Thread Furkan KAMACI

I have not indicated a URL and it solved as you mention. Because default
URL does not include /extract

2013/4/26 Furkan KAMACI furkankam...@gmail.com

 Ok, solved


 2013/4/26 Raymond Wiker rwi...@gmail.com

 On Fri, Apr 26, 2013 at 2:45 PM, Furkan KAMACI furkankam...@gmail.com
 wrote:

 
  I use that command to post:
  java -Durl=http://localhost:8983/solr/update/extract -Dauto -jar
 post.jar
  523387.pdf
 

 I think you need to have the collection name in the url... something like
 http://localhost:8983/solr/mycollection/update/extract .

Re: How to define a generic field to hold all undefined fields

2013-04-26 Thread Jack Krupansky

A dynamic field with the name pattern * and a type of string, 
stored=true, indexed=true and multiValued=true should be good enough 
for a generic field.


Generally, only use thing in test/experiment/development. It's not 
recommended as an approach for production apps.


There is a commented out * pattern in the example schema, but it ignores 
all incoming data, rather than index and store it.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Friday, April 26, 2013 9:16 AM
To: solr-user@lucene.apache.org
Subject: How to define a generic field to hold all undefined fields

I sen some documents to my Solr to be indexed. However I get such kind of
errors:

ERROR: [doc=0579B002] unknown field 'name'

I know that I should define a field named 'name' at mu schema. However
there maybe many of fields like that. How can I define a generic field that
holds all non defined values or maybe how can I ignore them?

Re: SOLR Install

2013-04-26 Thread jnduan

Hi Peri,
I think that document mesa you can deploy your own web app and solr in one 
container like tomcat,but with different
 context path.
If you want to bring solr in your project, you just need add some maven 
dependencies like:
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-core/artifactId
version4.2.1/version
 /dependency
dependency
groupIdorg.apache.solr/groupId
artifactIdsolr-test-framework/artifactId
version${solr.version}/version
 /dependency

This is what I do exactly.
Then you need to prefer a 'solr.home' dir in your project,and write some code 
in web.xml to configure some filter and servlet what solr need.I copied those 
configures and some admin page form solr.war.

I hope this could help ,best regards!


在 2013-4-25，上午12:58，Peri Subrahmanya peri.subrahma...@htcinc.com 写道：

 I m trying to use solr as part of another maven based web application. I m
 not sure how to wire the two war files. Any help please? I found this
 documentation in SOLR but unsure how to go about it.
 
 !-- If you are wiring Solr into a larger web application which controls
 the web context root, you will probably want to mount Solr under
 a path prefix (app.war with /app/solr mounted into it, for
 example).
 You will need to put this prefix in front of the
 SolrDispatchFilter
 url-pattern mapping too (/solr/*), and also on any paths for
 legacy Solr servlet mappings you may be using.
 For the Admin UI to work properly in a path-prefixed
 configuration,
 the admin folder containing the resources needs to be under the
 app context root
 named to match the path-prefix.  For example:
 
.war
   xxx
 js
   main.js
--
!--
init-param
  param-namepath-prefix/param-name
  param-value/xxx/param-value
/init-param
--
 
 
 Thank you,
 Peri Subrahmanya
 
 
 
 
 On 4/24/13 12:52 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 
 solrservice.php and the text of that error both sound like parts of
 Typo3... they're definitely not part of Solr. You should ask on a list
 devoted to Typo3 to figure out what to do in this situation. It likely
 won't involve reconfiguring Solr.
 
 Michael Della Bitta
 
 
 Appinions
 18 East 41st Street, 2nd Floor
 New York, NY 10017-6271
 
 www.appinions.com
 
 Where Influence Isn¹t a Game
 
 
 On Wed, Apr 24, 2013 at 11:53 AM, vishal gupta vishalgup...@yahoo.co.in
 wrote:
 Hi i am using Solr 4.2.0 and extension 2.8.2  with Typo3. Whever I try
 to do
 indexing pages and news pages It gets only 3.29% indexed. I checked a
 developer log and found error in solrservice.php. And in solr admin it
 is
 giving Dups is not defined please add it. What should i do in this
 case?
 If possible please send me the settings of schema.xml and
 solrconfig.xml .i
 am new to typo3 and solr both.
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-indeing-Partially-working-tp40586
 23.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
 
 
 
 
 *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
 recipient, please delete without copying and kindly advise us by e-mail of 
 the mistake in delivery.
 NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
 Services to any order or other contract unless pursuant to explicit written 
 agreement or government initiative expressly permitting the use of e-mail for 
 such purpose.

Re: Log Monitor System for SolrCloud and Logging to log4j at SolrCloud?

2013-04-26 Thread Mark Miller

Slf4j is meant to work with existing frameworks - you can set it up to work 
with log4j, and Solr will use log4j by default in the about to be released 4.3.

http://wiki.apache.org/solr/SolrLogging

- Mark

On Apr 26, 2013, at 7:19 AM, Furkan KAMACI furkankam...@gmail.com wrote:

 I want to use GrayLog2 to monitor my logging files for SolrCloud. However I
 think that GrayLog2 works with log4j and logback. Solr uses slf4j.
 How can I solve this problem and what logging monitoring system does folks
 use?

AutoSuggest+Grouping in one request

2013-04-26 Thread Rounak Jain

Hi everyone,

Search dropdowns on popular sites like Amazon (example
imagehttp://i.imgur.com/aQyM8WD.jpg)
use autosuggested words along with grouping (Field Collapsing in Solr).

While I can replicate the same functionality in Solr using two requests
(first to obtain suggestions, second for the actual query using the most
probable suggestion), I want to know if this can be done in one request
itself.

I understand that there are various ways to obtain suggestions (term
component, facets, Solr's inbuilt
Suggesterhttp://wiki.apache.org/solr/Suggester),
and I'm open to using any one of them, if it means I'll be able to get
everything (groups + suggestions) in one request.

Looking forward to some advice with regard to this.

Thanks,

Rounak

RE: Using another way instead of DIH

2013-04-26 Thread Dyer, James

yes, I misspoke.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: xiaoqi [mailto:belivexia...@gmail.com] 
Sent: Thursday, April 25, 2013 8:37 PM
To: solr-user@lucene.apache.org
Subject: RE: Using another way instead of DIH

Thanks for help .

data-config.xml ? i can not find this file , u mean data-import.xml or
solrconfig.xml ? 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-another-way-instead-of-DIH-tp4058937p4059067.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR Install

2013-04-26 Thread jnduan

If you unpack the solr.war file,you'll find some configures in web.xml like:
filter
filter-nameSolrRequestFilter/filter-name
filter-classorg.apache.solr.servlet.SolrDispatchFilter/filter-class
/filter

filter-mapping
filter-nameSolrRequestFilter/filter-name
url-pattern/*/url-pattern
/filter-mapping

servlet
servlet-nameZookeeper/servlet-name

servlet-classorg.apache.solr.servlet.ZookeeperInfoServlet/servlet-class
/servlet

servlet
servlet-nameLoadAdminUI/servlet-name

servlet-classorg.apache.solr.servlet.LoadAdminUiServlet/servlet-class
/servlet

and so on…

These configurations tells your application how to dispatch requests to solr.

Note that the SolrRequestFilter in solr.war's web.xml mapped to url pattern /*, 
if you want to make some sub-context for solr,maybe it should like /solr/* 
,then you need put the web resources like admin.html/css/img/js/tpl from 
solr.war in the *same* directory of your web app's WebRoot folder.For 
example,if you make SolrRequestFilter mapped to url pattern to /solr/* ,your 
WebRoot dir is looks like:

WebRoot
|
solr
|
---admin.html
|
---css
|……..

It is what the comment said in web.xml of solr.war,and I think it also what 
make you confused like you asked in the original email thread.

In my web.xml , I just copy the whole content of solr,paste them in mine and 
edit some url mapping.

在 2013-4-26，下午10:04，Peri Subrahmanya peri.subrahma...@htcinc.com 写道：

 Jundan,
 
 I got all the setup correctly i.e got the maven dependencies, and using
 maven overlay copy all the solr files to the WEB-INF directory and also
 specify solr.home. The issue is that when I try to access any of the solr
 urls like /admin.html or /dataimport, nothing seems to be happening.
 
 So I m not sure  how to correctly the web.xml; Would it be possible to
 share your web.xml please?
 
 Thank you,
 Peri Subrahmanya
 HTC Global Services
 (Development Manager for System Integration on Kuali OLE project @ Indiana
 University, Bloomington, USA)
 Cell: (+1) 618.407.3521
 Skype/Gtalk: peri.subrahmanya
 
 *** DISCLAIMER *** This is a PRIVATE message. If you are not the
 intended recipient, please delete without copying and kindly advise us
 by e-mail of the mistake in delivery.
 NOTE: Regardless of content, this e-mail shall not operate to bind HTC
 Global Services to any order or other contract unless pursuant to
 explicit written agreement or government initiative expressly permitting
 the use of e-mail for such purpose.
 
 
 
 
 On 4/26/13 9:51 AM, jnduan jnd...@gmail.com wrote:
 
 Hi Peri,
 I think that document mesa you can deploy your own web app and solr in
 one container like tomcat,but with different
 context path.
 If you want to bring solr in your project, you just need add some maven
 dependencies like:
 dependency
   groupIdorg.apache.solr/groupId
   artifactIdsolr-core/artifactId
   version4.2.1/version
 /dependency
 dependency
   groupIdorg.apache.solr/groupId
   artifactIdsolr-test-framework/artifactId
   version${solr.version}/version
 /dependency
 
 This is what I do exactly.
 Then you need to prefer a 'solr.home' dir in your project,and write some
 code in web.xml to configure some filter and servlet what solr need.I
 copied those configures and some admin page form solr.war.
 
 I hope this could help ,best regards!
 
 
 在 2013-4-25，上午12:58，Peri Subrahmanya peri.subrahma...@htcinc.com 写道：
 
 I m trying to use solr as part of another maven based web application.
 I m
 not sure how to wire the two war files. Any help please? I found this
 documentation in SOLR but unsure how to go about it.
 
 !-- If you are wiring Solr into a larger web application which controls
the web context root, you will probably want to mount Solr under
a path prefix (app.war with /app/solr mounted into it, for
 example).
You will need to put this prefix in front of the
 SolrDispatchFilter
url-pattern mapping too (/solr/*), and also on any paths for
legacy Solr servlet mappings you may be using.
For the Admin UI to work properly in a path-prefixed
 configuration,
the admin folder containing the resources needs to be under the
 app context root
named to match the path-prefix.  For example:
 
   .war
  xxx
js
  main.js
   --
   !--
   init-param
 param-namepath-prefix/param-name
 param-value/xxx/param-value
   /init-param
   --
 
 
 Thank you,
 Peri Subrahmanya
 
 
 
 
 On 4/24/13 12:52 PM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
 
 solrservice.php and the text of that error both sound like parts of
 Typo3... they're definitely not part of Solr. You should ask on a list
 devoted to Typo3 to figure out what to do in this situation. It

RE: Using another way instead of DIH

2013-04-26 Thread Dyer, James

Here are some things I would try:

1. Make sure the parent entity is only returning 1 row per solr document.  If 
not, move the problems joins to child entities to their own queries and child 
entities.

2. For the child entites, use caching.  This prevents the n+1 select problem. 
 The changes are:
  remove the pk attribute (only the parent entity needs this, and only to 
  support delta updates).
  remove the where clause from the query
  add cacheKey/cacheLookup to each child like this:  cacheKey='id' 
  cacheLookup='item.shop_id'
  add cacheImpl=SortedMapBackedCache to each child.  This will cache 
  in-memory.

3. If caching uses too much memory, see 
https://issues.apache.org/jira/browse/SOLR-2613  
https://issues.apache.org/jira/browse/SOLR-2948 .  These are disk-backed cache 
implementations that you can use as alternatives to SortedMapBackedCache.  Or 
you can write your own.

4. If it is still too slow, you can parallelize it by splitting the data into 
partitions then running multiple DIH handlers at once.  This is a somewhat 
complex solution but still might be easier than writing a multi-threaded import 
program yourself.  One way to partition SQL data like this is to add a where 
clause like where mod(id, 4)=${dataimporter.request.partitionNumber}

I will mention that I recently converted one of our applications to use its own 
solrj-based code to update instead of DIH.  We were using BerkleyBackedCache 
from SOLR-2613 to handle the child entites, and it worked well.  But the app 
dev team wanted something that was part of their codebase that they could 
maintain more easily, so we migrated off of DIH.  We do updates more frequently 
and batch the updates so everything can fit in-memory.  Doing it this way, the 
SolrJ code was very straightforward and quick  easy to write.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: xiaoqi [mailto:belivexia...@gmail.com] 
Sent: Friday, April 26, 2013 5:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Using another way instead of DIH


below is my data-import.xml

any suggestion ? 



?xml version=1.0 encoding=UTF-8?
dataConfig
dataSource type=JdbcDataSource
driver=com.mysql.jdbc.Driver
url=jdbc:mysql://*:3306/guang
user=guang
password=guang/

document name=products

entity name=item pk=id query=SELECT a.*,d.* FROM item a LEFT
JOIN item_ctr_by_area d ON a.id=d.item_id LEFT JOIN shop b ON a.shop_id =
b.id WHERE a.status =1 AND b.status = 1 AND b.uctrac_status =0 AND
uctrac_adgroup_id IS NOT NULL
field column=id name=item_id /
field column=title name=item_title /
field column=description name=item_description /
field column=price name=item_price /
field column=promotion name=item_promotion /
field column=pic_url name=item_picurl /
field column=local_pic_url name=item_local_picurl /
field column=detail_url name=item_detailurl /
field column=recommend_value name=item_recommend_value/
field column=uctrac_adgroup_id name=uctrac_adgroup_id/
field column=uctrac_price name=uctrac_adgroup_price/
field column=uctrac_status name=uctrac_adgroup_status/
field column=uctrac_creative_id name=uctrac_creative_id/
field column=lctr name=item_lctr/
field column=CTR_ALL name=region_ctr_all/
field column=CTR_N name=region_ctr_n/
field column=CTR_MN name=region_ctr_mn/
field column=CTR_MS name=region_ctr_ms/
field column=CTR_S name=region_ctr_s/
field column=CTR_011100 name=region_ctr_0111/
field column=CTR_011300 name=region_ctr_0113/
field column=CTR_012100 name=region_ctr_0121/
field column=CTR_013100 name=region_ctr_0131/
field column=CTR_013200 name=region_ctr_0132/
field column=CTR_013300 name=region_ctr_0133/
field column=CTR_013400 name=region_ctr_0134/
field column=CTR_013500 name=region_ctr_0135/
field column=CTR_013700 name=region_ctr_0137/
field column=CTR_014100 name=region_ctr_0141/
field column=CTR_014200 name=region_ctr_0142/
field column=CTR_014300 name=region_ctr_0143/
field column=CTR_014400 name=region_ctr_0144/
field column=CTR_015100 name=region_ctr_0151/
field column=CTR_016100 name=region_ctr_0161/
field column=CTR_ALL_2 name=region_ctr_all_2/
field column=CTR_N_2 name=region_ctr_n_2/
field column=CTR_MN_2 name=region_ctr_mn_2/
field column=CTR_MS_2 name=region_ctr_ms_2/
field column=CTR_S_2 name=region_ctr_s_2/
field column=CTR_011100_2 name=region_ctr_0111_2/
field column=CTR_011300_2 name=region_ctr_0113_2/

SolrDocument getFieldNames() exclude dynamic fields?

2013-04-26 Thread Luis Lebolo

Hi All,

I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a query.
When I use SolrDocument's getFieldNames(), I get back a list of fields that
excludes dynamic fields (even though I know they are not empty).

Is there a way to get a list of all fields for a given SolrDocument?

Thanks,
Luis

Re: Solr Indexing Rich Documents

2013-04-26 Thread Ahmet Arslan

Hi Furkan,

post.jar meant to be used as example, quick start etc. For production 
(incremental updates, deletes) consider using http://manifoldcf.apache.org for 
indexing rich documents. It utilises ExtractingRequestHandler feature of solr. 

--- On Fri, 4/26/13, Furkan KAMACI furkankam...@gmail.com wrote:

 From: Furkan KAMACI furkankam...@gmail.com
 Subject: Re: Solr Indexing Rich Documents
 To: solr-user@lucene.apache.org
 Date: Friday, April 26, 2013, 3:39 PM
 Thanks for the answer, I get an error
 now: FileNotFound Exception as I
 mentioned at other thread. Now I' trying to solve it.

 2013/4/26 Jack Krupansky j...@basetechnology.com

  It's called SolrCell or the ExtractingRequestHandler
 (/update/extract),
  which the newer post.jar knows to use for some file
 types:
  http://wiki.apache.org/solr/ExtractingRequestHandler

  -- Jack Krupansky

  -Original Message- From: Furkan KAMACI
  Sent: Friday, April 26, 2013 4:48 AM
  To: solr-user@lucene.apache.org
  Subject: Solr Indexing Rich Documents

  I have a large corpus of rich documents i.e. pdf and
 doc files. I think
  that I can use directly the example jar of Solr.
 However for a real time
  environment what should I care? Also how do you send
 such kind of documents
  into Solr to index, I think post.jar does not handle
 that file type?  I
  should mention that I don't store documents in a
 database.

Re: SolrDocument getFieldNames() exclude dynamic fields?

2013-04-26 Thread Luis Lebolo

Apologies, I wasn't storing these dynamic fields.


On Fri, Apr 26, 2013 at 11:01 AM, Luis Lebolo luis.leb...@gmail.com wrote:

 Hi All,

 I'm using SolrJ's QueryResponse to retrieve all SolrDocuments from a
 query. When I use SolrDocument's getFieldNames(), I get back a list of
 fields that excludes dynamic fields (even though I know they are not empty).

 Is there a way to get a list of all fields for a given SolrDocument?

 Thanks,
 Luis

excluding something from copyfield source?

2013-04-26 Thread Furkan KAMACI

Hi;

I use that:

copyField source=* dest=text/

however I want to exclude something i.e. author field. How can I do that?

Re: Solr Indexing Rich Documents

2013-04-26 Thread Furkan KAMACI

Is there any example at wiki for Manifold?

2013/4/26 Ahmet Arslan iori...@yahoo.com

 Hi Furkan,

 post.jar meant to be used as example, quick start etc. For production
 (incremental updates, deletes) consider using http://manifoldcf.apache.orgfor 
 indexing rich documents. It utilises ExtractingRequestHandler feature
 of solr.

 --- On Fri, 4/26/13, Furkan KAMACI furkankam...@gmail.com wrote:

  From: Furkan KAMACI furkankam...@gmail.com
  Subject: Re: Solr Indexing Rich Documents
  To: solr-user@lucene.apache.org
  Date: Friday, April 26, 2013, 3:39 PM
  Thanks for the answer, I get an error
  now: FileNotFound Exception as I
  mentioned at other thread. Now I' trying to solve it.
 
  2013/4/26 Jack Krupansky j...@basetechnology.com
 
   It's called SolrCell or the ExtractingRequestHandler
  (/update/extract),
   which the newer post.jar knows to use for some file
  types:
   http://wiki.apache.org/solr/ExtractingRequestHandler
  
   -- Jack Krupansky
  
   -Original Message- From: Furkan KAMACI
   Sent: Friday, April 26, 2013 4:48 AM
   To: solr-user@lucene.apache.org
   Subject: Solr Indexing Rich Documents
  
  
   I have a large corpus of rich documents i.e. pdf and
  doc files. I think
   that I can use directly the example jar of Solr.
  However for a real time
   environment what should I care? Also how do you send
  such kind of documents
   into Solr to index, I think post.jar does not handle
  that file type?  I
   should mention that I don't store documents in a
  database.

Exclude Pattern at Dynamic Field

2013-04-26 Thread Furkan KAMACI

I use that at my Solr 4.2.1:

dynamicField name=* type=ignored multiValued=true/

however can I exlude some patterns from it?

Re: Using another way instead of DIH

2013-04-26 Thread Shawn Heisey

On 4/25/2013 9:00 AM, xiaoqi wrote:
 i using DIH to build index is slow , when it fetch 2 million rows , it will
 spend 20 minutes , very slow. 

If it takes 20 minutes for two million records, I'd say it's working
very well.  I do six simultaneous MySQL imports of 13 million records
each.  It takes a little over 3 hours on Solr 3.5.0, a little over four
hours on Solr 4.2.1 (due to compression and the transaction log).  If I
do them one at a time instead of all at once, it will go *slightly*
faster for each one, but the overall process would take a whole day.
For comparison purposes, that's about 20 minutes each time it does 1
million rows.  Yours is going twice as fast as mine.

Looking at your config file, I don't see a batchSize parameter.  This is
a change that is specific to MySQL.  You can greatly reduce the memory
usage by including this attribute in the dataSource tag along with the
user and password:

batchSize=-1

With two million records and no batchSize parameter, I'm surprised you
aren't hitting an Out Of Memory error.  By default JDBC will pull down
all the results and store them in memory, then DIH will begin indexing.
 A batchSize of -1 makes DIH tell the MySQL JDBC driver to stream the
results instead of storing them.  Reducing the memory usage in this way
might make it go faster.

Thanks,
Shawn

DataImportHandler - Indexing xml content

2013-04-26 Thread Peri Subrahmanya

I have a column in my database that is of type long text and holds xml
content. I was wondering when I define the entity record is there a way to
provide a custom extractor that will take in the xml and return rows with
appropriate fields to be indexed.

Thank you,
Peri Subrahmanya





On 4/26/13 12:24 PM, Shawn Heisey s...@elyograg.org wrote:

On 4/25/2013 9:00 AM, xiaoqi wrote:
 i using DIH to build index is slow , when it fetch 2 million rows , it
will
 spend 20 minutes , very slow.

If it takes 20 minutes for two million records, I'd say it's working
very well.  I do six simultaneous MySQL imports of 13 million records
each.  It takes a little over 3 hours on Solr 3.5.0, a little over four
hours on Solr 4.2.1 (due to compression and the transaction log).  If I
do them one at a time instead of all at once, it will go *slightly*
faster for each one, but the overall process would take a whole day.
For comparison purposes, that's about 20 minutes each time it does 1
million rows.  Yours is going twice as fast as mine.

Looking at your config file, I don't see a batchSize parameter.  This is
a change that is specific to MySQL.  You can greatly reduce the memory
usage by including this attribute in the dataSource tag along with the
user and password:

batchSize=-1

With two million records and no batchSize parameter, I'm surprised you
aren't hitting an Out Of Memory error.  By default JDBC will pull down
all the results and store them in memory, then DIH will begin indexing.
 A batchSize of -1 makes DIH tell the MySQL JDBC driver to stream the
results instead of storing them.  Reducing the memory usage in this way
might make it go faster.

Thanks,
Shawn









*** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
recipient, please delete without copying and kindly advise us by e-mail of the 
mistake in delivery.
NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
Services to any order or other contract unless pursuant to explicit written 
agreement or government initiative expressly permitting the use of e-mail for 
such purpose.

Customizing Solr GUI

2013-04-26 Thread kneerosh

Hi,

  I want to customize Solr gui, and I learnt that the most popular options
are
1. Velocity- which is integrated with Solr. The format and options can be
customized
2. Project Blacklight

Pros and cons? 

Secondly I read that one can delete data by just running a delete query in
the URL. Does either velocity or blacklight provide a way to disable this,
or provide any kind of security or access control- so that users can only
browse/search and admins can view the admin screen. How can we handle the
security aspect in Solr?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Customizing-Solr-GUI-tp4059257.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: excluding something from copyfield source?

2013-04-26 Thread Gora Mohanty

On 26 April 2013 20:51, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi;

 I use that:

 copyField source=* dest=text/

 however I want to exclude something i.e. author field. How can I do that?

Instead of using *, use separate copyField directives for the
fields that you want copied. You can use more restrictive globs
also, e.g.,
copyField source=*_txt dest=text/

Regards,
Gora

Re: Customizing Solr GUI

2013-04-26 Thread Jack Krupansky

Generally, your UI web pages should communicate with your own application 
layer, which in turn communicates with Solr, but you should try to avoid 
having Solr itself visible to the outside world.


-- Jack Krupansky

-Original Message- 
From: kneerosh

Sent: Friday, April 26, 2013 12:46 PM
To: solr-user@lucene.apache.org
Subject: Customizing Solr GUI

Hi,

 I want to customize Solr gui, and I learnt that the most popular options
are
1. Velocity- which is integrated with Solr. The format and options can be
customized
2. Project Blacklight

Pros and cons?

Secondly I read that one can delete data by just running a delete query in
the URL. Does either velocity or blacklight provide a way to disable this,
or provide any kind of security or access control- so that users can only
browse/search and admins can view the admin screen. How can we handle the
security aspect in Solr?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Customizing-Solr-GUI-tp4059257.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DataImportHandler - Indexing xml content

2013-04-26 Thread Alexandre Rafalovitch

Have you looked at:
http://wiki.apache.org/solr/DataImportHandler#FieldReaderDataSource ?

Regards,
   Alex.

On Fri, Apr 26, 2013 at 12:29 PM, Peri Subrahmanya
peri.subrahma...@htcinc.com wrote:
 I have a column in my database that is of type long text and holds xml
 content. I was wondering when I define the entity record is there a way to
 provide a custom extractor that will take in the xml and return rows with
 appropriate fields to be indexed.



Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)

Re: Customizing Solr GUI

2013-04-26 Thread Alexandre Rafalovitch

So, building on this:
1) Velocity is an option for internal admin interface because it is
collocated with Solr and therefore does not 'hide' it
2) Blacklight is the (Rails-based) application layer and the Solr is
internal behind it, so it does provide the security.

Hope this helps to understand the distinction.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Apr 26, 2013 at 12:57 PM, Jack Krupansky
j...@basetechnology.com wrote:
 Generally, your UI web pages should communicate with your own application
 layer, which in turn communicates with Solr, but you should try to avoid
 having Solr itself visible to the outside world.

 -- Jack Krupansky

 -Original Message- From: kneerosh
 Sent: Friday, April 26, 2013 12:46 PM
 To: solr-user@lucene.apache.org
 Subject: Customizing Solr GUI


 Hi,

  I want to customize Solr gui, and I learnt that the most popular options
 are
 1. Velocity- which is integrated with Solr. The format and options can be
 customized
 2. Project Blacklight

 Pros and cons?

 Secondly I read that one can delete data by just running a delete query in
 the URL. Does either velocity or blacklight provide a way to disable this,
 or provide any kind of security or access control- so that users can only
 browse/search and admins can view the admin screen. How can we handle the
 security aspect in Solr?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Customizing-Solr-GUI-tp4059257.html
 Sent from the Solr - User mailing list archive at Nabble.com.

relevance when merging results

2013-04-26 Thread eShard

Hi,
I'm currently using Solr 4.0 final on tomcat v7.0.3x
I have 2 cores (let's call them A and B) and I need to combine them as one
for the UI. 
However we're having trouble on how to best merge these two result sets.
Currently, I'm using relevancy to do the merge. 
For example,
I search for red in both cores.
Core A has a max score of .919856 with 87 results
Core B has a max score or .6532563 with 30 results

I would like to simply merge numerically but I don't know if that's valid.
If I merge in numerical order then Core B results won't appear until element
25 or later.

I initially thought about just taking the top 5 results from each and layer
one on top of the other.

Is there a best practice out there for merging relevancy?
Please advise...
Thanks,




--
View this message in context: 
http://lucene.472066.n3.nabble.com/relevance-when-merging-results-tp4059275.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exclude Pattern at Dynamic Field

2013-04-26 Thread Jack Krupansky

No, other than to be explicit about individual patterns, which is better 
anyway.


Generally, * is a crutch or experimental tool (Let's just see what all 
the data and metadata is and then decide what to keep). It is better to use 
explicit patterns or static schema for production use.


-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Friday, April 26, 2013 11:29 AM
To: solr-user@lucene.apache.org
Subject: Exclude Pattern at Dynamic Field

I use that at my Solr 4.2.1:

dynamicField name=* type=ignored multiValued=true/

however can I exlude some patterns from it?

Re: Prons an Cons of Startup Lazy a Handler?

2013-04-26 Thread Chris Hostetter


: In short, whether you want to keep the handler is completely independent of
: the lazy startup option.

I think Jack missread your question -- my interpretation is that you are 
asking about the pros/cons of removing 'startup=lazy' ...

: requestHandler name=/update/extract
: startup=lazyclass=solr.extraction.ExtractingRequestHandler 
: 
: it startups it lazy. So what is pros and cons for removing it for my
: situation? 

...if you know you will definitely be using this handler, then you should 
probably remove startup=lazy -- the advantages of using lazy request 
handlers is that there is no init cost for having them in your config if 
you never use them, making it handy for the example configs that many 
people copy and re-use w/o modifying so that they don't pay any price in 
having features declared that htey don't use.


-Hoss

Hello Shawn,
We found that it is unrelated to the group queries instead more
related to the empty queries. Do you happen to know what could cause empty
queries like the following from SOLRJ ? I can generate similar query via
curl hitting the select handler like - http://server:port/solr/select

server.log_2013-04-26T05-02-22:[#|2013-04-26T04:33:39.065-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=httpSSLWorkerTh
read-9001-11;|[xxxcore] webapp=/solr path=/select params={} hits=24099
status=0 QTime=19 |#]

What we are seeing is a huge number of these empty queries. Once this
happens I have observed 2 things

1. even if I query from admin console, irrespective of the query, I get
same results as if its a cached page of *:* query. i.e. I cannot see the
query I entered in the server log, the query doesn't even come to the
server but I get same results as *:*

2. If I query via solrj no results are returned.

This has been driving me nuts for almost a week. Any help is greatly
appreciated.

Thanks

Ravi Kiran Bhaskar

On Sat, Apr 20, 2013 at 10:33 PM, Ravi Solr ravis...@gmail.com wrote:

Thanks for your advise Shawn. I have created a JIRA issue SOLR-4743.

On Sat, Apr 20, 2013 at 4:32 PM, Shawn Heisey s...@elyograg.org wrote:

On 4/20/2013 9:08 AM, Ravi Solr wrote:
Thanks you very much for responding Shawn. I never use IE, I use
firefox.
These are brand new servers and I don't think I am mixing versions. What
made you think I was using the 1.4.1 ?? You are correct in saying that
the
server is throwing HTML response since a group query has been failing
with
SEVERE error following which the entire instance behaves weirdly until
we
restart.

Its surprising that group query error handling has such glaring issue.
If
you specify group=true but don't specify group.query or group.field SOLR
throws a SEVERE exception following which we see the empty queries and
finally no responses via solrj and admin console gives numFound always
equal to total number of docs in index . Looks like the searcher goes
for a
spin once it encounters the exception. Such situation should have been
gracefully handled

Ah, so what's happening is that after an invalid grouping query, Solr is
unstable and stops working right. You should file an issue in Jira,
giving as much detail as you can. My last message was almost completely
wrong.

You are right that it should be gracefully handled, and obviously it is
not. For the 3.x Solr versions, grouping did not exist before 3.6. It
is a major 4.x feature that was backported. Sometimes such major
features depend on significant changes that have not happened on older
versions, leading to problems like this. Unfortunately, you could wait
quite a while for a fix on 3.6, where active development has stopped.

I have no personal experience with grouping, but I just tried the
problematic query (adding group=true to one that works) on 4.2.1. It
doesn't throw an error, I just get no results. When I follow it with a
regular query, everything works perfectly. Would you be able to upgrade
to 4.2.1? That's not a trivial thing to do, so hopefully you are
already working on upgrading.

Thanks,
Shawn

Re: Need to log query request before it is processed

2013-04-26 Thread Timothy Potter

Solved this using a custom SearchHandler and some Log4J goodness.
Posting here in case anyone has need for logging query request before
they are executed, which in my case is useful for tracking any queries
that cause OOMs

My solution uses Log4J's NDC support to log each query request before
it is processed ... the trick was that the SolrCore.execute method
logs at the very end so I wasn't able to push and pop the NDC from
first- and last- SearchComponents respectively. In other words,
SolrCore logs the query after all the search components complete so I
couldn't pop the NDC stack in a last-component.

Consequently, I created a simple extension to SearchHandler that
relies on the SolrRequestInfo close hook to pop the NDC:

public class NDCLoggingSearchHandler extends SearchHandler implements
Closeable {
private static final Logger log =
Logger.getLogger(NDCLoggingSearchHandler.class);
private static final AtomicInteger ndc = new AtomicInteger(0);

public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) {
SolrRequestInfo.getRequestInfo().addCloseHook(this);
NDC.push(Q:+ndc.incrementAndGet());
log.info(req.getParamString());
super.handleRequest(req, rsp);
}

public void close() throws IOException {
NDC.remove();
}
}


Now I get nice logging like:
2013-04-26 19:07:52,545 [qtp1480462011-13] INFO
analytics.solr.NDCLoggingSearchHandler Q:20 - indent=trueq=*:*wt=xml
2013-04-26 19:07:52,717 [qtp1480462011-13] INFO  solr.core.SolrCore
Q:20 - [solr_signal] webapp=/solr path=/select
params={indent=trueq=*:*wt=xml} hits=25389931 status=0 QTime=172

The Q:20 part is the NDC.

Cheers,
Tim

PS - I am so happy that Mark switched things to Log4J for 4.3 -
https://issues.apache.org/jira/browse/SOLR-3706 +1x10

On Thu, Apr 25, 2013 at 5:44 PM, Sudhakar Maddineni
maddineni...@gmail.com wrote:
 HI Tim,
   Have you tried by enabling the logging levels on httpclient, which is
 used by solrj classes internally?

 Thx,Sudhakar.


 On Thu, Apr 25, 2013 at 10:12 AM, Timothy Potter thelabd...@gmail.comwrote:

 I would like to log query requests before they are processed.
 Currently, it seems they are only logged after being processed. I've
 tried enabling a finer logging level but that didn't seem to help.
 I've enabled request logging in Jetty but most queries come in as
 POSTs from SolrJ

 I was thinking of adding a query request logger as a first-component
 but wanted to see what others have done for this?

 Thanks.
 Tim

Re: Need to log query request before it is processed

2013-04-26 Thread Sudhakar Maddineni

I see.Thanks for sharing.

-Sudhakar.

On Friday, April 26, 2013, Timothy Potter wrote:

 Solved this using a custom SearchHandler and some Log4J goodness.
 Posting here in case anyone has need for logging query request before
 they are executed, which in my case is useful for tracking any queries
 that cause OOMs

 My solution uses Log4J's NDC support to log each query request before
 it is processed ... the trick was that the SolrCore.execute method
 logs at the very end so I wasn't able to push and pop the NDC from
 first- and last- SearchComponents respectively. In other words,
 SolrCore logs the query after all the search components complete so I
 couldn't pop the NDC stack in a last-component.

 Consequently, I created a simple extension to SearchHandler that
 relies on the SolrRequestInfo close hook to pop the NDC:

 public class NDCLoggingSearchHandler extends SearchHandler implements
 Closeable {
 private static final Logger log =
 Logger.getLogger(NDCLoggingSearchHandler.class);
 private static final AtomicInteger ndc = new AtomicInteger(0);

 public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp)
 {
 SolrRequestInfo.getRequestInfo().addCloseHook(this);
 NDC.push(Q:+ndc.incrementAndGet());
 log.info(req.getParamString());
 super.handleRequest(req, rsp);
 }

 public void close() throws IOException {
 NDC.remove();
 }
 }


 Now I get nice logging like:
 2013-04-26 19:07:52,545 [qtp1480462011-13] INFO
 analytics.solr.NDCLoggingSearchHandler Q:20 - indent=trueq=*:*wt=xml
 2013-04-26 19:07:52,717 [qtp1480462011-13] INFO  solr.core.SolrCore
 Q:20 - [solr_signal] webapp=/solr path=/select
 params={indent=trueq=*:*wt=xml} hits=25389931 status=0 QTime=172

 The Q:20 part is the NDC.

 Cheers,
 Tim

 PS - I am so happy that Mark switched things to Log4J for 4.3 -
 https://issues.apache.org/jira/browse/SOLR-3706 +1x10

 On Thu, Apr 25, 2013 at 5:44 PM, Sudhakar Maddineni
 maddineni...@gmail.com javascript:; wrote:
  HI Tim,
Have you tried by enabling the logging levels on httpclient, which is
  used by solrj classes internally?
 
  Thx,Sudhakar.
 
 
  On Thu, Apr 25, 2013 at 10:12 AM, Timothy Potter 
  thelabd...@gmail.comjavascript:;
 wrote:
 
  I would like to log query requests before they are processed.
  Currently, it seems they are only logged after being processed. I've
  tried enabling a finer logging level but that didn't seem to help.
  I've enabled request logging in Jetty but most queries come in as
  POSTs from SolrJ
 
  I was thinking of adding a query request logger as a first-component
  but wanted to see what others have done for this?
 
  Thanks.
  Tim

Not In query

2013-04-26 Thread André Maldonado

Hi all.

We have an index with 300.000 documents and a lot, a lot of fields.

We're planning a module where users will choose some documents to exclude
from their search results. So, these documents will be excluded for UserA
and visible for UserB.

So, we have some options to do this. The simplest way is to do a Not In
query in document id. But we don't know the performance impact this will
have. Is this an option?

There is another reasonable way to accomplish this?

Thank's

*
--
*
*E conhecereis a verdade, e a verdade vos libertará. (João 8:32)*

 *andre.maldonado*@gmail.com andre.maldon...@gmail.com
 (11) 9112-4227

http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
http://www.facebook.com/profile.php?id=10659376883
  http://twitter.com/andremaldonado http://www.delicious.com/andre.maldonado
  https://profiles.google.com/105605760943701739931
http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3
  http://www.youtube.com/andremaldonado

facet.offset issue (previosly: [solr 3.4] anomaly during distributed facet query with 102 shards)

2013-04-26 Thread Dmitry Kan

Hi list,

We have encountered a weird bug related to the facet.offset parameter. In
short: the more general query is, that generates lots of hits, the higher
the risk of the facet.offset parameter to stop working.

In more detail:

1. Since getting all facets we need (facet.limit=1000) from around 100
shards didn't work for some broad query terms, like the (yes, we index
and search those too), we decided to paginate.

2. The facet page size is set to 100 for all pages starting the second one.
We start with: facet.offset=0facet.limit=30, then continue with
facet.offset=30facet.limit=100, then facet.offset=100facet.limit=100 and
so on, until we get facet.offset=900.

All facets work just fine, until we hit facet.offset=700.

Debugging showed, that in the class HttpCommComponent static Executor
instance is created with a setting to terminate idle threads after 5 sec.
Our belief, is that this setting way too low for our billion document
scenario and broad searches. Setting this to 5 min seems to improve the
situation a bit, but not solve fully. This same class is no longer used in
4.2.1 (can anyone tell what's used instead in distributed faceting?) so it
isn't easy to compare these parts of the code.

Anyhow, playing now with this value in the hope to see some light in the
tunnel (would be good, if it is not the train).

One more question: can this be related to RAM allocation on the router and
/ or shards? If RAM isn't enough for some operations, why the router or
shards wouldn't just crash with OOM?

If anyone has other ideas for what to try / look into, that'll be much
appreciated.

Dmitry

Re: Weird query issues

2013-04-26 Thread Shawn Heisey

On 4/26/2013 1:01 PM, Ravi Solr wrote:
 Hello Shawn,
 We found that it is unrelated to the group queries instead more
 related to the empty queries. Do you happen to know what could cause empty
 queries like the following from SOLRJ ? I can generate similar query via
 curl hitting the select handler like - http://server:port/solr/select
 
 server.log_2013-04-26T05-02-22:[#|2013-04-26T04:33:39.065-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=httpSSLWorkerTh
 read-9001-11;|[xxxcore] webapp=/solr path=/select params={} hits=24099
 status=0 QTime=19 |#]
 
 What we are seeing is a huge number of these empty queries. Once this
 happens I have observed 2 things
 
 1. even if I query from admin console, irrespective of the query, I get
 same results as if its a cached page of *:* query. i.e. I cannot see the
 query I entered in the server log, the query doesn't even come to the
 server but I get same results as *:*
 
 2. If I query via solrj no results are returned.
 
 This has been driving me nuts for almost a week. Any help is greatly
 appreciated.

Querying from the admin UI and not seeing anything in the server log
sounds like browser caching.  You can turn that off in solrconfig.xml.

I could not duplicate what you're seeing with SolrJ.  You didn't say
what version of SolrJ, so I did this using 3.6.2 (same as your server
version).  I thought maybe if you had a query object that didn't have an
actual query set, it might do what you're seeing, but that doesn't
appear to be the case.  I don't have a 3.6.2 server to test against, so
I used my 3.5.0 and 4.2.1 servers.

Test code:
http://pastie.org/private/bnvurz1f9b9viawgqbxvmq

Solr 4.2.1 log:
INFO  - 2013-04-26 14:17:24.127; org.apache.solr.core.SolrCore; [ncmain]
webapp=/solr path=/select params={wt=xmlversion=2.2} hits=0 status=0
QTime=20

3.5.0 server log:

Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException

Apr 26, 2013 2:20:23 PM org.apache.solr.core.SolrCore execute
INFO: [ncmain] webapp=/solr path=/select params={wt=xmlversion=2.2}
status=500 QTime=0
Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException



Same code without the setParser line:

Solr 4.2.1 log:
INFO  - 2013-04-26 14:14:01.270; org.apache.solr.core.SolrCore; [ncmain]
webapp=/solr path=/select params={wt=javabinversion=2} hits=0 status=0
QTime=187

Thanks,
Shawn

Re: How to define a generic field to hold all undefined fields

2013-04-26 Thread Jan Høydahl

I can highly recommend reading the documentation before asking questions :)

You are using the ExtractingRequestHandler, which is documented on the WIKI 
like most other stuff is. The fastest way to search for Solr stuff would be 
using search-lucene.com 
http://search-lucene.com/?q=extracting+request+handlerfc_project=Solr

Reading that wiki page you'll notice the parameters uprefix and defaultField 
which would both be ways to solve your problem.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

26. apr. 2013 kl. 15:16 skrev Furkan KAMACI furkankam...@gmail.com:

 I sen some documents to my Solr to be indexed. However I get such kind of
 errors:
 
 ERROR: [doc=0579B002] unknown field 'name'
 
 I know that I should define a field named 'name' at mu schema. However
 there maybe many of fields like that. How can I define a generic field that
 holds all non defined values or maybe how can I ignore them?

IOexception, when using Solr 4.2.1 for indexing

2013-04-26 Thread Sarita Nair

Hi All,

I get the error below on trying to index using Solr 4.2.1.  I have a single 
core setup and use HttpSolrServer with DefaultHttpClient to talk to Solr.
#Here is how HttpSolrServer is instantiated:
solrServer = new HttpSolrServer( baseURL,
        configurator.createHttpClient( new BasicHttpParams( ) ) );
        
#DefaultHttpClient creation:
public DefaultHttpClient createHttpClient( HttpParams parameters ) {
DefaultHttpClient httpClient = new DefaultHttpClient( connectionManager,
                parameters );
httpClient.setRoutePlanner( routePlanner );
return httpClient;
}


Any ideas on what is it that I am doing incorrectly will be appreciated.
Thanks!

Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://example.com:8080/solr-server-4.2.1
    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:416)
 ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42]
    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
 ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42]  
    at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42]
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42]  
  
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42]  
   
    at 
com.qpidhealth.qpid.solr.SolrService.saveOrUpdate(SolrService.java:117) 
~[classes/:na]
    ... 84 common frames omitted
Caused by: org.apache.http.client.ClientProtocolException: null
    at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353)
 ~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:28:42]
    ... 89 common frames omitted
Caused by: org.apache.http.client.NonRepeatableRequestException: Cannot retry 
request with a non-repeatable request entity.  The cause lists the reason the 
original request fail$
    at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:517)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
 ~[httpclient-4.2.2.jar:4.2.2]
    ... 92 common frames omitted
Caused by: java.net.SocketException: Broken pipe
    at java.net.SocketOutputStream.socketWrite0(Native Method) 
~[na:1.7.0_05]
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) 
~[na:1.7.0_05]
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153) 
~[na:1.7.0_05]
    at 
org.apache.http.impl.io.AbstractSessionOutputBuffer.flushBuffer(AbstractSessionOutputBuffer.java:147)
 ~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:167)
 ~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110)
 ~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165) 
~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92) 
~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98) 
~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
 ~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
 ~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197)
 ~[httpclient-4.2.2.jar:4.2.2]
    at 
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
 ~[httpcore-4.2.2.jar:4.2.2]
    at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
 ~[httpcore-4.2.2.jar:4.2.2]

Re: Not In query

2013-04-26 Thread Jan Høydahl

I would start with the way you propose, a negative filter

q=foo barfq=-id:(123 729 640 112...)

This will effectively hide those doc ids, and a benefit is that it is cached so 
if the list of ids is long, you'll only take the performance hit the first 
time. I don't know your application, but if it is highly likely that a single 
user will add excludes for several thousand ids then you should perhaps 
consider other options and benchmark up front.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

26. apr. 2013 kl. 21:50 skrev André Maldonado andre.maldon...@gmail.com:

 Hi all.
 
 We have an index with 300.000 documents and a lot, a lot of fields.
 
 We're planning a module where users will choose some documents to exclude
 from their search results. So, these documents will be excluded for UserA
 and visible for UserB.
 
 So, we have some options to do this. The simplest way is to do a Not In
 query in document id. But we don't know the performance impact this will
 have. Is this an option?
 
 There is another reasonable way to accomplish this?
 
 Thank's
 
 *
 --
 *
 *E conhecereis a verdade, e a verdade vos libertará. (João 8:32)*
 
 *andre.maldonado*@gmail.com andre.maldon...@gmail.com
 (11) 9112-4227
 
 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
 http://www.orkut.com.br/Main#Profile?uid=2397703412199036664
 http://www.facebook.com/profile.php?id=10659376883
  http://twitter.com/andremaldonado 
 http://www.delicious.com/andre.maldonado
  https://profiles.google.com/105605760943701739931
 http://www.linkedin.com/pub/andr%C3%A9-maldonado/23/234/4b3
  http://www.youtube.com/andremaldonado

Re: How to get/set customized Solr data source properties?

2013-04-26 Thread Chris Hostetter

: 
: I am working on a DataSource implementation. I want to get some customized
: properties when the *DataSource.init* method is called. I tried to add the
...
: dataConfig
:   dataSource type=com.my.company.datasource
: my=value /

My understanding from looking at other DataSources is that should work.

: But initProps.getProperty(my) == null.

can you show us some actual that fails with that dataConfig you mentioned?


-Hoss

Re: Solr index searcher to lucene index searcher

2013-04-26 Thread Chris Hostetter

: used to call the lucene IndexSearcher . As the documents are collected in
: TopDocs in Lucene , before that is passed back to Nutch , i used to look
: into the top K matching documents , consult some external repository
: and further score the Top K documents and reorder them in the TopDocs array
: . These reordered  TopDocs is passed to Nutch .  All these reordering code
: was implemented by Extending the lucene IndexSearcher class .

1) that's basically the same info you provided before -- it still doesn't 
really tell us anything about what your current logic does with the top K 
documents and how/why/when you decide to reorder them or by how much -- 
details that are kind of important in being able to provide you with any 
meaningful advice on how to achieve your goal using hte plugin hooks 
available in Solr.

2) if you only care about re-ordering the Top K documents using some 
secret sauce, then i would suggest you just set rows=K and let Solr do 
it's thing, the post process the reuslts -- either in your client, or in a 
SearchComponent that modifies the SolrDocumentList produces by 
QueryComponent.

:  can you elaborate on what exactly your some logic involves?
...
:  https://people.apache.org/~hossman/#xyproblem
:  XY Problem
: 
:  Your question appears to be an XY Problem ... that is: you are dealing
:  with X, you are assuming Y will help you, and you are asking about Y
:  without giving more details about the X so that we can understand the
:  full issue.  Perhaps the best solution doesn't involve Y at all?
:  See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss

Re: Weird query issues

2013-04-26 Thread Ravi Solr

Thanks Shawn, We are using 3.6.2 client and server. I cleared my browser
cache several times while querying (is that similar to clear cache in
solrconfig.xml ?). The query is logged in the solrj based client's
application container however I see it empty in the solr's application
container...so somehow it is getting swallowed by solr...Iam not able to
figure out how and why ?

Thanks
Ravi Kiran Bhaskar

On Fri, Apr 26, 2013 at 4:33 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/26/2013 1:01 PM, Ravi Solr wrote:
  Hello Shawn,
  We found that it is unrelated to the group queries instead more
  related to the empty queries. Do you happen to know what could cause
 empty
  queries like the following from SOLRJ ? I can generate similar query via
  curl hitting the select handler like - http://server:port/solr/select
 
 
 server.log_2013-04-26T05-02-22:[#|2013-04-26T04:33:39.065-0400|INFO|sun-appserver2.1.1|org.apache.solr.core.SolrCore|_ThreadID=38;_ThreadName=httpSSLWorkerTh
  read-9001-11;|[xxxcore] webapp=/solr path=/select params={} hits=24099
  status=0 QTime=19 |#]
 
  What we are seeing is a huge number of these empty queries. Once this
  happens I have observed 2 things
 
  1. even if I query from admin console, irrespective of the query, I get
  same results as if its a cached page of *:* query. i.e. I cannot see the
  query I entered in the server log, the query doesn't even come to the
  server but I get same results as *:*
 
  2. If I query via solrj no results are returned.
 
  This has been driving me nuts for almost a week. Any help is greatly
  appreciated.

 Querying from the admin UI and not seeing anything in the server log
 sounds like browser caching.  You can turn that off in solrconfig.xml.

 I could not duplicate what you're seeing with SolrJ.  You didn't say
 what version of SolrJ, so I did this using 3.6.2 (same as your server
 version).  I thought maybe if you had a query object that didn't have an
 actual query set, it might do what you're seeing, but that doesn't
 appear to be the case.  I don't have a 3.6.2 server to test against, so
 I used my 3.5.0 and 4.2.1 servers.

 Test code:
 http://pastie.org/private/bnvurz1f9b9viawgqbxvmq

 Solr 4.2.1 log:
 INFO  - 2013-04-26 14:17:24.127; org.apache.solr.core.SolrCore; [ncmain]
 webapp=/solr path=/select params={wt=xmlversion=2.2} hits=0 status=0
 QTime=20

 3.5.0 server log:
 
 Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException

 Apr 26, 2013 2:20:23 PM org.apache.solr.core.SolrCore execute
 INFO: [ncmain] webapp=/solr path=/select params={wt=xmlversion=2.2}
 status=500 QTime=0
 Apr 26, 2013 2:20:23 PM org.apache.solr.common.SolrException log
 SEVERE: java.lang.NullPointerException
 


 Same code without the setParser line:

 Solr 4.2.1 log:
 INFO  - 2013-04-26 14:14:01.270; org.apache.solr.core.SolrCore; [ncmain]
 webapp=/solr path=/select params={wt=javabinversion=2} hits=0 status=0
 QTime=187

 Thanks,
 Shawn

Re: Solr index searcher to lucene index searcher

2013-04-26 Thread parnab kumar

Hi  ,

Thanks Chris . For every document that matches the query i want to able
to compute the following set of features for a query document pair

LuceneScore ( The vector space score that lucene gives to each doc)
LinkScore  ( computed from nutch )
OpicScore ( computed from nutch)
   co-rd in title,content,anchor,url
   wt of Entity in title,content,anchor,url
   length of title,content,anchor,url
   sum-of-tf in title,content,anchor,url
   sum-of-norm-tf in title,content,anchor,url
   min-of-tf in title,content,anchor,url
   max-of-tf in title,content,anchor,url
   variance-of-tf in title,content,anchor,url
   sum-of-tf-idf in title,content,anchor,url
   site-reputation-score
   enity-support-score
   domain score
  url-click-count
   query-url-click-count
  num-of-slashes-in-url

Based on these above features i want to build a machine learned model that
will learn to rank/score the documents .i am trying to understand how to
compute the features efficiently on the fly. Looking into the index and
computing these features seems to be very slow . So for the time being i
wanted to implement the same by looking into the TopK documents.Few of
these features has to be computed on the fly and some of them are computed
while indexing and stored in the index . I need to be able to look into all
features to score/rank the final set of documents.

Thanks ,
Pom..

On Sat, Apr 27, 2013 at 5:43 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

 : used to call the lucene IndexSearcher . As the documents are collected in
 : TopDocs in Lucene , before that is passed back to Nutch , i used to look
 : into the top K matching documents , consult some external repository
 : and further score the Top K documents and reorder them in the TopDocs
 array
 : . These reordered  TopDocs is passed to Nutch .  All these reordering
 code
 : was implemented by Extending the lucene IndexSearcher class .

 1) that's basically the same info you provided before -- it still doesn't
 really tell us anything about what your current logic does with the top K
 documents and how/why/when you decide to reorder them or by how much --
 details that are kind of important in being able to provide you with any
 meaningful advice on how to achieve your goal using hte plugin hooks
 available in Solr.

 2) if you only care about re-ordering the Top K documents using some
 secret sauce, then i would suggest you just set rows=K and let Solr do
 it's thing, the post process the reuslts -- either in your client, or in a
 SearchComponent that modifies the SolrDocumentList produces by
 QueryComponent.

 :  can you elaborate on what exactly your some logic involves?
 ...
 :  https://people.apache.org/~hossman/#xyproblem
 :  XY Problem
 : 
 :  Your question appears to be an XY Problem ... that is: you are
 dealing
 :  with X, you are assuming Y will help you, and you are asking about
 Y
 :  without giving more details about the X so that we can understand the
 :  full issue.  Perhaps the best solution doesn't involve Y at all?
 :  See Also: http://www.perlmonks.org/index.pl?node_id=542341


 -Hoss

66 matches

Mail list logo