Re: Binary content index with multiple cores

2012-07-26 Thread davidbougearel
Ok i find a way to use it, it was a problem with librairies.

In fact i dont want to index PDF or Word directly i just want to get the
content to add into my document content so i guess i will have to use tika
to get the XML and to get the node that i want.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Binary-content-index-with-multiple-cores-tp3997221p3997370.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Binary content index with multiple cores

2012-07-26 Thread davidbougearel
To help finding the solution, with my JUnit test here is the stack trace :

org.apache.solr.client.solrj.SolrServerException: Server at
http://localhost:8983/solr/document returned non ok status:500,
message:Internal Server Error
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)


And the console error from apache tomcat :

[WARNING] [talledLocalContainer] Jul 26, 2012 7:32:20 AM
org.apache.solr.common.SolrException log
[WARNING] [talledLocalContainer] SEVERE:
org.apache.solr.common.SolrException: lazy loading error
[WARNING] [talledLocalContainer]at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:260)
[WARNING] [talledLocalContainer]at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
[WARNING] [talledLocalContainer]at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
[WARNING] [talledLocalContainer]at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
[WARNING] [talledLocalContainer]at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
[WARNING] [talledLocalContainer]at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
[WARNING] [talledLocalContainer]at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
[WARNING] [talledLocalContainer]at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
[WARNING] [talledLocalContainer]at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
[WARNING] [talledLocalContainer]at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
[WARNING] [talledLocalContainer]at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
[WARNING] [talledLocalContainer]at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:615)
[WARNING] [talledLocalContainer]at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
[WARNING] [talledLocalContainer]at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
[WARNING] [talledLocalContainer]at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
[WARNING] [talledLocalContainer]at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
[WARNING] [talledLocalContainer]at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
[WARNING] [talledLocalContainer]at java.lang.Thread.run(Thread.java:722)

[WARNING] [talledLocalContainer] Caused by:
*org.apache.solr.common.SolrException: Error Instantiating Request Handler,
solr.extraction.ExtractingRequestHandler is not a
org.apache.solr.request.SolrRequestHandler*
[WARNING] [talledLocalContainer]at
org.apache.solr.core.SolrCore.createInstance(SolrCore.java:421)
[WARNING] [talledLocalContainer]at
org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:455)
[WARNING] [talledLocalContainer]at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:251)
[WARNING] [talledLocalContainer]... 17 more
[WARNING] [talledLocalContainer] 

I hope it will help you to find something wrong.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Binary-content-index-with-multiple-cores-tp3997221p3997368.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: numFound inconsistent for different rows-param

2012-07-26 Thread patrick
i resolved my confusion and discovered that the documents of the second 
shard contained the same 'unique' id.


rows=0 displayed the 'correct' numFound since (as i understand) there 
was no merge of the results.


cheerio,
patrick

On 25.07.2012 17:07, patrick wrote:

hi,

i'm running two solr v3.6 instances:

rdta01:9983/solr/msg-core  : 8 documents
rdta01:28983/solr/msg-core : 4 documents

the following two queries with rows=10 resp rows=0 return different
numFound results which confuses me. i hope someone can clarify this
behaviour.

URL with rows=10:
-
http://rdta01:9983/solr/msg-core/select?q=*:*shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-coreindent=onstart=0rows=10

numFound=8 (incorrect, second shard is missing)

URL with rows=0:

http://rdta01:9983/solr/msg-core/select?q=*:*shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-coreindent=onstart=0rows=0

numFound=12 (correct)

cheerio,
patrick




Re: Binary content index with multiple cores

2012-07-26 Thread davidbougearel
Thanks for replying, here is my dependency related to solr-cell :

 org.apache.solr:solr-cell:jar:3.6.0:compile
[INFO] |  +- com.ibm.icu:icu4j:jar:4.8.1.1:compile
[INFO] |  +- *org.apache.tika:tika-parsers:jar:1.0:compile*
[INFO] |  |  +- org.apache.tika:tika-core:jar:1.0:compile
[INFO] |  |  +- edu.ucar:netcdf:jar:4.2-min:compile
[INFO] |  |  +- org.apache.james:apache-mime4j-core:jar:0.7:compile
[INFO] |  |  +- org.apache.james:apache-mime4j-dom:jar:0.7:compile
[INFO] |  |  +- org.apache.commons:commons-compress:jar:1.3:compile
[INFO] |  |  +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
[INFO] |  |  |  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
[INFO] |  |  |  \- org.apache.pdfbox:jempbox:jar:1.6.0:compile
[INFO] |  |  +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
[INFO] |  |  +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
[INFO] |  |  +- org.apache.poi:poi:jar:3.8-beta4:compile
[INFO] |  |  +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile
[INFO] |  |  +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile
[INFO] |  |  |  \- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile
[INFO] |  |  | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
[INFO] |  |  +-
org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
[INFO] |  |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] |  |  +- asm:asm:jar:3.1:compile
[INFO] |  |  +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
[INFO] |  |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] |  |  \- rome:rome:jar:0.9:compile
[INFO] |  | \- jdom:jdom:jar:1.0:compile
[INFO] |  \- xerces:xercesImpl:jar:2.8.1:compile
[INFO] | \- xml-apis:xml-apis:jar:1.3.03:compile

As you can see i have the tika-parsers.

About the solr.war, when i start my mvn cargo:run i put into the pom.xml the
fact that he create the sol.war and for solr-cell tomcat needs some
dependencies like solr-cell, solr-core, solr-solrj, tika-core and slf4j-api.

Have you any idea about where is my mistake ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Binary-content-index-with-multiple-cores-tp3997221p3997367.html
Sent from the Solr - User mailing list archive at Nabble.com.


Bulk indexing data into solr

2012-07-26 Thread Zhang, Lisheng

Hi,

I am starting to use solr, now I need to index a rather large amount of data, 
it seems
that calling solr to pass data through HTTP is rather inefficient, I am think 
still call 
lucene API directly for bulk index but to use solr for search, is this design 
OK?

Thanks very much for helps, Lisheng



Re: solr spellchecker hogging all of my memory

2012-07-26 Thread Michael Della Bitta
Do the spellcheck objects eventually get collected off the heap? Maybe
you should dump the heap later and ensure those objects get collected,
in which case, I'd call this a normal heap expansion due to a
temporary usage spike.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Jul 25, 2012 at 10:03 PM, dboychuck dboych...@build.com wrote:
 before I optimize (build my spellchecker index) my solr instance running in
 tomcat uses about 2 gigs of memory
  as soon as I optimize it jumps to about 5 gigs
 http://d.pr/i/oUQI

 it just doesn't seem right

 http://pastebin.com/6Cg7F0dK

 is there anything wrong with my configuration?
 when i dump the heap I can see that spellchecker is using a majority of the
 memory




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-spellchecker-hogging-all-of-my-memory-tp3997353.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Skip first word

2012-07-26 Thread Finotti Simone
Hi Ahmet,
business asked me to apply EdgeNGram with minGramSize=1 on the first term and 
with minGramSize=3 on the latter terms.

We are developing a search suggestion mechanism, the idea is that if the user 
types D, the engine should suggest Dolce  Gabbana, but if we type G, it 
should suggest other brands. Only if users type Gab it should suggest Dolce 
 Gabbana.

Thanks
S

Inizio: Ahmet Arslan [iori...@yahoo.com]
Inviato: mercoledì 25 luglio 2012 18.10
Fine: solr-user@lucene.apache.org
Oggetto: Re: Skip first word

 is there a tokenizer and/or a combination of filter to
 remove the first term from a field?

 For example:
 The quick brown fox

 should be tokenized as:
 quick
 brown
 fox

There is no such filter that i know of. Though, you can implement one with 
modifying source code of LengthFilterFactory or StopFilterFactory. They both 
remove tokens. Out of curiosity, what is the use case for this?






solr host name on solrconfig.xml

2012-07-26 Thread stockii
Hello

i need the host name of my solr-server in my solrconfig.xml
anybody knows the correct variable?

something like ${solr.host} or ${solr.host.name} ...

exists an documantation about ALL available variables in the solr
namespaces?

thx a lot



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-host-name-on-solrconfig-xml-tp3997371.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Bulk indexing data into solr

2012-07-26 Thread Rafał Kuć
Hello!

If you use Java (and I think you do, because you mention Lucene) you
should take a look at StreamingUpdateSolrServer. It not only allows
you to send data in batches, but also index using multiple threads.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch


 Hi,

 I am starting to use solr, now I need to index a rather large amount of data, 
 it seems
 that calling solr to pass data through HTTP is rather inefficient, I am think 
 still call
 lucene API directly for bulk index but to use solr for search, is this design 
 OK?

 Thanks very much for helps, Lisheng



Re: Binary content index with multiple cores

2012-07-26 Thread Ahmet Arslan
 About the solr.war, when i start my mvn cargo:run i put into
 the pom.xml the
 fact that he create the sol.war and for solr-cell tomcat
 needs some
 dependencies like solr-cell, solr-core, solr-solrj,
 tika-core and slf4j-api.
 
 Have you any idea about where is my mistake ?

Okey, for solr-cell tomcat needs dependencies. These dependencies shipped with 
solr download. (apache-solr-3.6.1.tgz for example). You don't need to embed 
those jars into solr.war. You can consume them using lib directives.
That said, to enable solr-cell you don't need to re-create solr nor use maven.


Solr - hl.fragsize Issue

2012-07-26 Thread meghana
i am using solr 3.5 , and in search query i set hl.fragsize = 100 , but my
fragment does not contain exact 100 chars , average fragment size is 120 .

Can anybody have idea about this issue??

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-hl-fragsize-Issue-tp3997457.html
Sent from the Solr - User mailing list archive at Nabble.com.


Expression Sort in Solr

2012-07-26 Thread lavesh
 am working on solr for search. I required to perform a expression sort such
that

ORDER BY (IF(COUNTRY=1,100,0) + IF(AVAILABLE=2,1000,IF(AVAILABLE=1,60,0)) +
IF (DELIVERYIN IN (5,6,7),100,IF (DELIVERYIN IN (80,90),50,0))) DESC

can anyone tell me hows is this possible?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Expression Sort in Solr

2012-07-26 Thread Erik Hatcher
How dynamic are those numbers?   If this expression can be computed at index 
time into a sort_order field, that'd be best.  Otherwise, if these factors 
are truly dynamic at run-time, look at the function query sorting capability 
here: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function and build up 
the expression from there.  I still encourage you to aim towards computing as 
much of this at index-time as possible to minimize the functions (and thus 
caches) you need at query time.

Erik

On Jul 26, 2012, at 03:47 , lavesh wrote:

 am working on solr for search. I required to perform a expression sort such
 that
 
 ORDER BY (IF(COUNTRY=1,100,0) + IF(AVAILABLE=2,1000,IF(AVAILABLE=1,60,0)) +
 IF (DELIVERYIN IN (5,6,7),100,IF (DELIVERYIN IN (80,90),50,0))) DESC
 
 can anyone tell me hows is this possible?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr - hl.fragsize Issue

2012-07-26 Thread Ahmet Arslan
 i am using solr 3.5 , and in search
 query i set hl.fragsize = 100 , but my
 fragment does not contain exact 100 chars , average fragment
 size is 120 .
 
 Can anybody have idea about this issue??

Are you using FastVectorHighlighter or DefaultSolrHighlighter?
Could it be that 120 includes character numbers of em tags ?


Re: leaks in solr

2012-07-26 Thread Karthick Duraisamy Soundararaj
Did you find any more clues? I have this problem in my machines as well..

On Fri, Jun 29, 2012 at 6:04 AM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Hi list,

 while monitoring my solr 3.6.1 installation I recognized an increase of
 memory usage
 in OldGen JVM heap on my slave. I decided to force Full GC from jvisualvm
 and
 send optimize to the already optimized slave index. Normally this helps
 because
 I have monitored this issue over the past. But not this time. The Full GC
 didn't free any memory. So I decided to take a heap dump and see what
 MemoryAnalyzer
 is showing. The heap dump is about 23 GB in size.

 1.)
 Report Top consumers - Biggest Objects:
 Total: 12.3 GB
 org.apache.lucene.search.FieldCacheImpl : 8.1 GB
 class java.lang.ref.Finalizer   : 2.1 GB
 org.apache.solr.util.ConcurrentLRUCache : 1.5 GB
 org.apache.lucene.index.ReadOnlySegmentReader : 622.5 MB
 ...

 As you can see, Finalizer has already reached 2.1 GB!!!

 * java.util.concurrent.ConcurrentHashMap$Segment[16] @ 0x37b056fd0
   * segments java.util.concurrent.ConcurrentHashMap @ 0x39b02d268
 * map org.apache.solr.util.ConcurrentLRUCache @ 0x398f33c30
   * referent java.lang.ref.Finalizer @ 0x37affa810
 * next java.lang.ref.Finalizer @ 0x37affa838
 ...

 Seams to be org.apache.solr.util.ConcurrentLRUCache
 The attributes are:

 Type   |Name  | Value
 -
 boolean| isDestroyed  |  true
 -
 ref| cleanupThread|  null
 
 ref| evictionListener |  null
 ---
 long   | oldestEntry  | 0
 --
 int| acceptableWaterMark |  9500
 --
 ref| stats| org.apache.solr.util.ConcurrentLRUCache$Stats
 @ 0x37b074dc8
 
 boolean| islive   |  true
 -
 boolean| newThreadForCleanup | false
 
 boolean| isCleaning   | false

 
 ref| markAndSweepLock | java.util.concurrent.locks.ReentrantLock @
 0x39bf63978
 -
 int| lowerWaterMark   |  9000
 -
 int| upperWaterMark   | 1
 -
 ref|  map | java.util.concurrent.ConcurrentHashMap @
 0x39b02d268
 --




 2.)
 While searching for open files and their references I noticed that there
 are references to
 index files which are already deleted from disk.
 E.g. recent index files are data/index/_2iqw.frq and
 data/index/_2iqx.frq.
 But I also see references to data/index/_2hid.frq which are quite old
 and are deleted way back
 from earlier replications.
 I have to analyze this a bit deeper.


 So far my report, I go on analyzing this huge heap dump.
 If you need any other info or even the heap dump, let me know.


 Regards
 Bernd




Re: Bulk indexing data into solr

2012-07-26 Thread Shawn Heisey

On 7/26/2012 7:34 AM, Rafał Kuć wrote:

If you use Java (and I think you do, because you mention Lucene) you
should take a look at StreamingUpdateSolrServer. It not only allows
you to send data in batches, but also index using multiple threads.


A caveat to what Rafał said:

The streaming object has no error detection out of the box.  It queues 
everything up internally and returns immediately.  Behind the scenes, it 
uses multiple threads to send documents to Solr, but any errors 
encountered are simply sent to the logging mechanism, then ignored.  
When you use HttpSolrServer, all errors encountered will throw 
exceptions, but you have to wait for completion.  If you need both 
concurrent capability and error detection, you would have to manage 
multiple indexing threads yourself.


Apparently there is a method in the concurrent class that you can 
override and handle errors differently, though I have not seen how to 
write code so your program would know that an error occurred.  I filed 
an issue with a patch to solve this, but some of the developers have 
come up with an idea that might be better.  None of the ideas have been 
committed to the project.


https://issues.apache.org/jira/browse/SOLR-3284

Just an FYI, the streaming class was renamed to 
ConcurrentUpdateSolrServer in Solr 4.0 Alpha.  Both are available in 3.6.x.


Thanks,
Shawn



querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Hi,

i am facing the following issue:

I have couple of million documents, which have a field called source_id.
My problem is, that I want to retrieve all the documents which have a
source_id
in a specific range of values. This range can be pretty big, so for example
a
list of 200 to 2000 source ids.

I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
6 .)
but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
huge
number of values.

Another solution that came into my mind was to assigned all the documents I
want to
retrieve a new kind of filter id. So all the documents which i want to
analyse
get a new id. But i need to update all the millions of documents for this
and assign
them a new id. This could take some time.

Do you can think of a nicer way to solve this issue?

Regards  greetings

Daniel


Re: Expression Sort in Solr

2012-07-26 Thread lavesh
Hi

i know we look to create at index time however all values are dynamic

if(exists(query(COUNTRY:(22 33
44)),100,20),INCOMEhttp://devjs.infoedge.com:8080/solr/select?q=*:*fq=GENDER:FEMALEsort=sum(if(exists(query(AGE:22)),100,20),INCOME
)

IS NOT WORKING
ALSO I NEED NESTED IF

On Thu, Jul 26, 2012 at 8:48 PM, Erik Hatcher-4 [via Lucene] 
ml-node+s472066n3997464...@n3.nabble.com wrote:

 How dynamic are those numbers?   If this expression can be computed at
 index time into a sort_order field, that'd be best.  Otherwise, if these
 factors are truly dynamic at run-time, look at the function query sorting
 capability here: 
 http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function and build up
 the expression from there.  I still encourage you to aim towards computing
 as much of this at index-time as possible to minimize the functions (and
 thus caches) you need at query time.

 Erik

 On Jul 26, 2012, at 03:47 , lavesh wrote:

  am working on solr for search. I required to perform a expression sort
 such
  that
 
  ORDER BY (IF(COUNTRY=1,100,0) +
 IF(AVAILABLE=2,1000,IF(AVAILABLE=1,60,0)) +
  IF (DELIVERYIN IN (5,6,7),100,IF (DELIVERYIN IN (80,90),50,0))) DESC
 
  can anyone tell me hows is this possible?
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369.html
  Sent from the Solr - User mailing list archive at Nabble.com.



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369p3997464.html
  To unsubscribe from Expression Sort in Solr, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3997369code=bGF2ZXNoLnJhd2F0QGdtYWlsLmNvbXwzOTk3MzY5fDYyODA2MjY2MQ==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml




-- 

Never explain yourself. Your friends don’t need it and
your enemies won’t believe it .




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Expression-Sort-in-Solr-tp3997369p3997475.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: querying using filter query and lots of possible values

2012-07-26 Thread Chantal Ackermann
Hi Daniel,

index the id into a field of type tint or tlong and use a range query 
(http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):

fq=id:[200 TO 2000]

If you want to exclude certain ids it might be wiser to simply add an exclusion 
query in addition to the range query instead of listing all the single values. 
You will run into problems with too long request urls. If you cannot avoid long 
urls you might want to increase maxBooleanClauses (see 
http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).

Cheers,
Chantal

Am 26.07.2012 um 18:01 schrieb Daniel Brügge:

 Hi,
 
 i am facing the following issue:
 
 I have couple of million documents, which have a field called source_id.
 My problem is, that I want to retrieve all the documents which have a
 source_id
 in a specific range of values. This range can be pretty big, so for example
 a
 list of 200 to 2000 source ids.
 
 I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
 6 .)
 but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
 huge
 number of values.
 
 Another solution that came into my mind was to assigned all the documents I
 want to
 retrieve a new kind of filter id. So all the documents which i want to
 analyse
 get a new id. But i need to update all the millions of documents for this
 and assign
 them a new id. This could take some time.
 
 Do you can think of a nicer way to solve this issue?
 
 Regards  greetings
 
 Daniel



Re: Skip first word

2012-07-26 Thread Chantal Ackermann
Hi,

use two fields:
1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for 
inputs of length  3,
2. the other one tokenized as appropriate with minsize=3 and longer for all 
longer inputs


Cheers,
Chantal


Am 26.07.2012 um 09:05 schrieb Finotti Simone:

 Hi Ahmet,
 business asked me to apply EdgeNGram with minGramSize=1 on the first term and 
 with minGramSize=3 on the latter terms.
 
 We are developing a search suggestion mechanism, the idea is that if the user 
 types D, the engine should suggest Dolce  Gabbana, but if we type G, 
 it should suggest other brands. Only if users type Gab it should suggest 
 Dolce  Gabbana.
 
 Thanks
 S
 
 Inizio: Ahmet Arslan [iori...@yahoo.com]
 Inviato: mercoledì 25 luglio 2012 18.10
 Fine: solr-user@lucene.apache.org
 Oggetto: Re: Skip first word
 
 is there a tokenizer and/or a combination of filter to
 remove the first term from a field?
 
 For example:
 The quick brown fox
 
 should be tokenized as:
 quick
 brown
 fox
 
 There is no such filter that i know of. Though, you can implement one with 
 modifying source code of LengthFilterFactory or StopFilterFactory. They both 
 remove tokens. Out of curiosity, what is the use case for this?
 
 
 
 



RE: Bulk indexing data into solr

2012-07-26 Thread Zhang, Lisheng
Thanks very much, both your and Rafal's advice are very helpful!

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Thursday, July 26, 2012 8:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr


On 7/26/2012 7:34 AM, Rafał Kuć wrote:
 If you use Java (and I think you do, because you mention Lucene) you
 should take a look at StreamingUpdateSolrServer. It not only allows
 you to send data in batches, but also index using multiple threads.

A caveat to what Rafał said:

The streaming object has no error detection out of the box.  It queues 
everything up internally and returns immediately.  Behind the scenes, it 
uses multiple threads to send documents to Solr, but any errors 
encountered are simply sent to the logging mechanism, then ignored.  
When you use HttpSolrServer, all errors encountered will throw 
exceptions, but you have to wait for completion.  If you need both 
concurrent capability and error detection, you would have to manage 
multiple indexing threads yourself.

Apparently there is a method in the concurrent class that you can 
override and handle errors differently, though I have not seen how to 
write code so your program would know that an error occurred.  I filed 
an issue with a patch to solve this, but some of the developers have 
come up with an idea that might be better.  None of the ideas have been 
committed to the project.

https://issues.apache.org/jira/browse/SOLR-3284

Just an FYI, the streaming class was renamed to 
ConcurrentUpdateSolrServer in Solr 4.0 Alpha.  Both are available in 3.6.x.

Thanks,
Shawn



Re: Bulk indexing data into solr

2012-07-26 Thread Mikhail Khludnev
Right in time, guys. https://issues.apache.org/jira/browse/SOLR-3585

Here is server side update processing fork. It does the best for halting
processing on exception occurs.  Plug this UpdateProcessor, specify number
of threads. Then submit lazy iterator into StreamingUpdateServer at client
side.

PS: Don't do the following: send many-many docs one-by-one or instantiate
huge arrayList of SolrInputDocument at client-side.

On Thu, Jul 26, 2012 at 7:46 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/26/2012 7:34 AM, Rafał Kuć wrote:

 If you use Java (and I think you do, because you mention Lucene) you
 should take a look at StreamingUpdateSolrServer. It not only allows
 you to send data in batches, but also index using multiple threads.


 A caveat to what Rafał said:

 The streaming object has no error detection out of the box.  It queues
 everything up internally and returns immediately.  Behind the scenes, it
 uses multiple threads to send documents to Solr, but any errors encountered
 are simply sent to the logging mechanism, then ignored.  When you use
 HttpSolrServer, all errors encountered will throw exceptions, but you have
 to wait for completion.  If you need both concurrent capability and error
 detection, you would have to manage multiple indexing threads yourself.

 Apparently there is a method in the concurrent class that you can override
 and handle errors differently, though I have not seen how to write code so
 your program would know that an error occurred.  I filed an issue with a
 patch to solve this, but some of the developers have come up with an idea
 that might be better.  None of the ideas have been committed to the project.

 https://issues.apache.org/**jira/browse/SOLR-3284https://issues.apache.org/jira/browse/SOLR-3284

 Just an FYI, the streaming class was renamed to ConcurrentUpdateSolrServer
 in Solr 4.0 Alpha.  Both are available in 3.6.x.

 Thanks,
 Shawn




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Hey Chantal,

thanks for your answer.

The range queries would not work, because they are not values in a row.
They can be randomly ordered with gaps. Above was just an example.

Excluding is also not a solution, because the list of excluded id would be
even longer.

To specify it even more. The IDs are not even integers, but UUIDs. And they
are tens of thousands. And the document pool contains hundreds of million
documents.

Thanks. Daniel



On Thu, Jul 26, 2012 at 6:22 PM, Chantal Ackermann 
c.ackerm...@it-agenten.com wrote:

 Hi Daniel,

 index the id into a field of type tint or tlong and use a range query (
 http://wiki.apache.org/solr/SolrQuerySyntax?highlight=%28rangequery%29):

 fq=id:[200 TO 2000]

 If you want to exclude certain ids it might be wiser to simply add an
 exclusion query in addition to the range query instead of listing all the
 single values. You will run into problems with too long request urls. If
 you cannot avoid long urls you might want to increase maxBooleanClauses
 (see http://wiki.apache.org/solr/SolrConfigXml/#The_Query_Section).

 Cheers,
 Chantal

 Am 26.07.2012 um 18:01 schrieb Daniel Brügge:

  Hi,
 
  i am facing the following issue:
 
  I have couple of million documents, which have a field called
 source_id.
  My problem is, that I want to retrieve all the documents which have a
  source_id
  in a specific range of values. This range can be pretty big, so for
 example
  a
  list of 200 to 2000 source ids.
 
  I was thinking that a filter query can be used like fq=source_id:(1 2 3
 4 5
  6 .)
  but this reminds me of SQLs WHERE IN (...) which was always bit slow for
 a
  huge
  number of values.
 
  Another solution that came into my mind was to assigned all the
 documents I
  want to
  retrieve a new kind of filter id. So all the documents which i want to
  analyse
  get a new id. But i need to update all the millions of documents for this
  and assign
  them a new id. This could take some time.
 
  Do you can think of a nicer way to solve this issue?
 
  Regards  greetings
 
  Daniel




Is it possible or wise to query multiple cores in parallel in SolrCloud

2012-07-26 Thread Daniel Brügge
Hi,

I am playing around with a SolrCloud setup (4 shards) and thousands of
cores.
I am thinking of executing queries on hundreds of cores like a distributed
query.

Is this possible at all from SolrCloud side. And is this wise?

Thanks  regards

Daniel


Re: Bulk indexing data into solr

2012-07-26 Thread Mikhail Khludnev
Coming back to your original question. I'm puzzled a little.
It's not clear where you wanna call Lucene API directly from.
if you mean that you has standalone indexer, which write index files. Then
it stops and these files become available for Solr Process it will work.
Sharing index between processes, or using EmbeddedServer is looking for
problem (despite Lucene has Locks mechanism, which I'm not completely aware
of).
I can conclude that your data for indexing is collocate with the solr
server. In this case consider
http://wiki.apache.org/solr/ContentStream#RemoteStreaming

Please give more details about your design.

On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng 
lisheng.zh...@broadvision.com wrote:


 Hi,

 I am starting to use solr, now I need to index a rather large amount of
 data, it seems
 that calling solr to pass data through HTTP is rather inefficient, I am
 think still call
 lucene API directly for bulk index but to use solr for search, is this
 design OK?

 Thanks very much for helps, Lisheng




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: querying using filter query and lots of possible values

2012-07-26 Thread Alexandre Rafalovitch
You can't update the original documents except by reindexing them, so
no easy group assigment option.

If you create this 'collection' once but query it multiple times, you
may be able to use SOLR4 join with IDs being stored separately and
joined on. Still not great because the performance is an issue when
mapping on IDs:
http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ .

If the list is some sort of combination of smaller lists - you could
probably precompute (at index time) those fragments and do compound
query over them.

But if you have to query every time and the list is different every
time, that could be complicated.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge
daniel.brue...@googlemail.com wrote:
 Hi,

 i am facing the following issue:

 I have couple of million documents, which have a field called source_id.
 My problem is, that I want to retrieve all the documents which have a
 source_id
 in a specific range of values. This range can be pretty big, so for example
 a
 list of 200 to 2000 source ids.

 I was thinking that a filter query can be used like fq=source_id:(1 2 3 4 5
 6 .)
 but this reminds me of SQLs WHERE IN (...) which was always bit slow for a
 huge
 number of values.

 Another solution that came into my mind was to assigned all the documents I
 want to
 retrieve a new kind of filter id. So all the documents which i want to
 analyse
 get a new id. But i need to update all the millions of documents for this
 and assign
 them a new id. This could take some time.

 Do you can think of a nicer way to solve this issue?

 Regards  greetings

 Daniel


Re: querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Thanks Alexandre,

the list of IDs is constant for a longer time. I will take a look at
these join thematic.
Maybe another solution would be to really create a whole new
collection or set of documents containing the aggregated documents (from the
ids) from scratch and to execute queries on this collection. Then this
would take
some time, but maybe it's worth it because the querying will thank you.

Daniel

On Thu, Jul 26, 2012 at 7:43 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 You can't update the original documents except by reindexing them, so
 no easy group assigment option.

 If you create this 'collection' once but query it multiple times, you
 may be able to use SOLR4 join with IDs being stored separately and
 joined on. Still not great because the performance is an issue when
 mapping on IDs:
 http://www.lucidimagination.com/blog/2012/06/20/solr-and-joins/ .

 If the list is some sort of combination of smaller lists - you could
 probably precompute (at index time) those fragments and do compound
 query over them.

 But if you have to query every time and the list is different every
 time, that could be complicated.

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Thu, Jul 26, 2012 at 12:01 PM, Daniel Brügge
 daniel.brue...@googlemail.com wrote:
  Hi,
 
  i am facing the following issue:
 
  I have couple of million documents, which have a field called
 source_id.
  My problem is, that I want to retrieve all the documents which have a
  source_id
  in a specific range of values. This range can be pretty big, so for
 example
  a
  list of 200 to 2000 source ids.
 
  I was thinking that a filter query can be used like fq=source_id:(1 2 3
 4 5
  6 .)
  but this reminds me of SQLs WHERE IN (...) which was always bit slow for
 a
  huge
  number of values.
 
  Another solution that came into my mind was to assigned all the
 documents I
  want to
  retrieve a new kind of filter id. So all the documents which i want to
  analyse
  get a new id. But i need to update all the millions of documents for this
  and assign
  them a new id. This could take some time.
 
  Do you can think of a nicer way to solve this issue?
 
  Regards  greetings
 
  Daniel



Re: Skip first word

2012-07-26 Thread in.abdul
That's is best option I had also used shingle filter factory . .
On Jul 26, 2012 10:03 PM, Chantal Ackermann-2 [via Lucene] 
ml-node+s472066n399748...@n3.nabble.com wrote:

 Hi,

 use two fields:
 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2
 for inputs of length  3,
 2. the other one tokenized as appropriate with minsize=3 and longer for
 all longer inputs


 Cheers,
 Chantal


 Am 26.07.2012 um 09:05 schrieb Finotti Simone:

  Hi Ahmet,
  business asked me to apply EdgeNGram with minGramSize=1 on the first
 term and with minGramSize=3 on the latter terms.
 
  We are developing a search suggestion mechanism, the idea is that if the
 user types D, the engine should suggest Dolce  Gabbana, but if we type
 G, it should suggest other brands. Only if users type Gab it should
 suggest Dolce  Gabbana.
 
  Thanks
  S
  
  Inizio: Ahmet Arslan [[hidden 
  email]http://user/SendEmail.jtp?type=nodenode=3997480i=0]

  Inviato: mercoledì 25 luglio 2012 18.10
  Fine: [hidden email]http://user/SendEmail.jtp?type=nodenode=3997480i=1
  Oggetto: Re: Skip first word
 
  is there a tokenizer and/or a combination of filter to
  remove the first term from a field?
 
  For example:
  The quick brown fox
 
  should be tokenized as:
  quick
  brown
  fox
 
  There is no such filter that i know of. Though, you can implement one
 with modifying source code of LengthFilterFactory or StopFilterFactory.
 They both remove tokens. Out of curiosity, what is the use case for this?
 
 
 
 



 --
  If you reply to this email, your message will be added to the discussion
 below:
 http://lucene.472066.n3.nabble.com/Skip-first-word-tp3997277p3997480.html
  To unsubscribe from Lucene, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





-
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Skip-first-word-tp3997277p3997509.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Bulk indexing data into solr

2012-07-26 Thread Zhang, Lisheng
Hi,

I think at least before lucene 4.0 we can only allow one process/thread to 
write on
a lucene folder. Based on this fact my initial plan is:

1) There is one set of lucene index folders.
2) Solr server only perform queries in those servers
3) Having a separate process (multi-threads) to index those lucene folders 
(each 
   folder is a separate app). Only one thread will index one given lucene 
folder.

Thanks very much for helps, Lisheng


-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: Thursday, July 26, 2012 10:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr


Coming back to your original question. I'm puzzled a little.
It's not clear where you wanna call Lucene API directly from.
if you mean that you has standalone indexer, which write index files. Then
it stops and these files become available for Solr Process it will work.
Sharing index between processes, or using EmbeddedServer is looking for
problem (despite Lucene has Locks mechanism, which I'm not completely aware
of).
I can conclude that your data for indexing is collocate with the solr
server. In this case consider
http://wiki.apache.org/solr/ContentStream#RemoteStreaming

Please give more details about your design.

On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng 
lisheng.zh...@broadvision.com wrote:


 Hi,

 I am starting to use solr, now I need to index a rather large amount of
 data, it seems
 that calling solr to pass data through HTTP is rather inefficient, I am
 think still call
 lucene API directly for bulk index but to use solr for search, is this
 design OK?

 Thanks very much for helps, Lisheng




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: querying using filter query and lots of possible values

2012-07-26 Thread Chantal Ackermann
Hi Daniel,

depending on how you decide on the list of ids, in the first place, you could 
also create a new index (core) and populate it with DIH which would select only 
documents from your main index (core) in this range of ids. When updating you 
could try a delta import.

Of course, this is only worth the effort if that core would exist for some time 
- but you've written that the subset of ids is constant for a longer time.

Just another idea on top ;-)
Chantal

Re: separation of indexes to optimize facet queries without fulltext

2012-07-26 Thread Chris Hostetter

: My thought was, that I could separate indexes. So for the facet queries
: where I don't need
: fulltext search (so also no indexed fulltext field) I can use a completely
: new setup of a
: sharded Solr which doesn't include the indexed fulltext, so the index is
: kept small containing
: just the few fields I have.
: 
: And for the fulltext queries I have the current Solr configuration which
: includes as mentioned
: above all the fields incl. the index fulltext field.
: 
: Is this a normal way of handling these requirements. That there are
: different kind of
: Solr configurations for the different needs? Because the huge redundancy

It's definitley doable -- one thing i'm not clear on is why, if your 
faceting queries don't care about the full text, you would need to leave 
those small fields in your full index ... is your plan to do 
faceting and drill down using the smaller index, but then display docs 
resulting from those queries by using the same fq params when querying 
the full index ?  

if so then it should work, if not -- you may not need those fields in that 
index.

In general there is nothing wrong with having multiple indexes to solve 
multiple usecases -- an index is usually an inverted denormalization of 
some structured source data designed for fast queries/retrieval.  If there 
are multiple distinct ways you want to query/retrieve data that don't lend 
themselves to the same denormalization, there's nothing wrong with 
multiple denormalizations.

Something else to consider is an approach i've used many times: having a 
single index, but using special purpose replicas.  You can have a master 
index that you update at the rate of change, one set of slaves that are 
used for one type of query pattern (faceting on X, Y, and Z for example) 
and a differnet set of slaves that are used for a different query pattern 
(faceting on A, B, and C) so each set of slaves gets a higher cahce hit 
rate then if the queries were randomized across all machines

-Hoss


Re: querying using filter query and lots of possible values

2012-07-26 Thread Daniel Brügge
Exactly. Creating a new index from the aggregated documents is the plan
I described above. I don't really now, how long this will take for each
new index. Hopefully under 1 hour or so. That would be tolerable.

Thanks. Daniel

On Thu, Jul 26, 2012 at 8:47 PM, Chantal Ackermann 
c.ackerm...@it-agenten.com wrote:

 Hi Daniel,

 depending on how you decide on the list of ids, in the first place, you
 could also create a new index (core) and populate it with DIH which would
 select only documents from your main index (core) in this range of ids.
 When updating you could try a delta import.

 Of course, this is only worth the effort if that core would exist for some
 time - but you've written that the subset of ids is constant for a longer
 time.

 Just another idea on top ;-)
 Chantal


Re: leaks in solr

2012-07-26 Thread roz dev
Hi Guys

I am also seeing this problem.

I am using SOLR 4 from Trunk and seeing this issue repeat every day.

Any inputs about how to resolve this would be great

-Saroj


On Thu, Jul 26, 2012 at 8:33 AM, Karthick Duraisamy Soundararaj 
karthick.soundara...@gmail.com wrote:

 Did you find any more clues? I have this problem in my machines as well..

 On Fri, Jun 29, 2012 at 6:04 AM, Bernd Fehling 
 bernd.fehl...@uni-bielefeld.de wrote:

  Hi list,
 
  while monitoring my solr 3.6.1 installation I recognized an increase of
  memory usage
  in OldGen JVM heap on my slave. I decided to force Full GC from jvisualvm
  and
  send optimize to the already optimized slave index. Normally this helps
  because
  I have monitored this issue over the past. But not this time. The Full GC
  didn't free any memory. So I decided to take a heap dump and see what
  MemoryAnalyzer
  is showing. The heap dump is about 23 GB in size.
 
  1.)
  Report Top consumers - Biggest Objects:
  Total: 12.3 GB
  org.apache.lucene.search.FieldCacheImpl : 8.1 GB
  class java.lang.ref.Finalizer   : 2.1 GB
  org.apache.solr.util.ConcurrentLRUCache : 1.5 GB
  org.apache.lucene.index.ReadOnlySegmentReader : 622.5 MB
  ...
 
  As you can see, Finalizer has already reached 2.1 GB!!!
 
  * java.util.concurrent.ConcurrentHashMap$Segment[16] @ 0x37b056fd0
* segments java.util.concurrent.ConcurrentHashMap @ 0x39b02d268
  * map org.apache.solr.util.ConcurrentLRUCache @ 0x398f33c30
* referent java.lang.ref.Finalizer @ 0x37affa810
  * next java.lang.ref.Finalizer @ 0x37affa838
  ...
 
  Seams to be org.apache.solr.util.ConcurrentLRUCache
  The attributes are:
 
  Type   |Name  | Value
  -
  boolean| isDestroyed  |  true
  -
  ref| cleanupThread|  null
  
  ref| evictionListener |  null
  ---
  long   | oldestEntry  | 0
  --
  int| acceptableWaterMark |  9500
 
 --
  ref| stats| org.apache.solr.util.ConcurrentLRUCache$Stats
  @ 0x37b074dc8
  
  boolean| islive   |  true
  -
  boolean| newThreadForCleanup | false
  
  boolean| isCleaning   | false
 
 
 
  ref| markAndSweepLock | java.util.concurrent.locks.ReentrantLock @
  0x39bf63978
  -
  int| lowerWaterMark   |  9000
  -
  int| upperWaterMark   | 1
  -
  ref|  map | java.util.concurrent.ConcurrentHashMap @
  0x39b02d268
  --
 
 
 
 
  2.)
  While searching for open files and their references I noticed that there
  are references to
  index files which are already deleted from disk.
  E.g. recent index files are data/index/_2iqw.frq and
  data/index/_2iqx.frq.
  But I also see references to data/index/_2hid.frq which are quite old
  and are deleted way back
  from earlier replications.
  I have to analyze this a bit deeper.
 
 
  So far my report, I go on analyzing this huge heap dump.
  If you need any other info or even the heap dump, let me know.
 
 
  Regards
  Bernd
 
 



Re: language detection and phonetic

2012-07-26 Thread Paul Libbrecht

Le 26 juil. 2012 à 21:22, Alireza Salimi a écrit :

 The question is: is there any cleaner way to do that?

I've always done phonetic match using a separate phonetic field (title-ph for 
example) and copy-field.
There's one considerable advantage to that: using such as dismax, you can say 
prefer exact matches, but also honour phonetic matches (by boosting the 
title-fr^2 title-ph^1.1).

Paul



Re: Bulk indexing data into solr

2012-07-26 Thread Mikhail Khludnev
IIRC about a two month ago problem with such scheme discussed here, but I
can remember exact details.
Scheme is generally correct. But you didn't tell how do you let solr know
that it need to reread new index generation, after indexer fsync segments
get.

btw, it might be a possible issue:
https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit()
 Note that this operation calls Directory.sync on the index files. That
call should not return until the file contents  metadata are on stable
storage. For FSDirectory, this calls the OS's fsync. But, beware: some
hardware devices may in fact cache writes even during fsync, and return
before the bits are actually on stable storage, to give the appearance of
faster performance.

you should ensure that after segments.get is fsync'ed, all other index
files are fsynced for other processes too.

Could you tell more about your data:
what's the format?
whether they are located relatively to indexer?
And why you can't use remote streaming by Solr's upd handler or indexer
client app with StreamingUpdateServer ?

On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng 
lisheng.zh...@broadvision.com wrote:

 Hi,

 I think at least before lucene 4.0 we can only allow one process/thread to
 write on
 a lucene folder. Based on this fact my initial plan is:

 1) There is one set of lucene index folders.
 2) Solr server only perform queries in those servers
 3) Having a separate process (multi-threads) to index those lucene folders
 (each
folder is a separate app). Only one thread will index one given lucene
 folder.

 Thanks very much for helps, Lisheng


 -Original Message-
 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
 Sent: Thursday, July 26, 2012 10:15 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Bulk indexing data into solr


 Coming back to your original question. I'm puzzled a little.
 It's not clear where you wanna call Lucene API directly from.
 if you mean that you has standalone indexer, which write index files. Then
 it stops and these files become available for Solr Process it will work.
 Sharing index between processes, or using EmbeddedServer is looking for
 problem (despite Lucene has Locks mechanism, which I'm not completely aware
 of).
 I can conclude that your data for indexing is collocate with the solr
 server. In this case consider
 http://wiki.apache.org/solr/ContentStream#RemoteStreaming

 Please give more details about your design.

 On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng 
 lisheng.zh...@broadvision.com wrote:

 
  Hi,
 
  I am starting to use solr, now I need to index a rather large amount of
  data, it seems
  that calling solr to pass data through HTTP is rather inefficient, I am
  think still call
  lucene API directly for bulk index but to use solr for search, is this
  design OK?
 
  Thanks very much for helps, Lisheng
 
 


 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: separation of indexes to optimize facet queries without fulltext

2012-07-26 Thread Daniel Brügge
Hi Chris,

thanks for the answer.

the plan is that in lots of queries I just need faceted values and
don't even do a fulltext search.
And on the other hand I need the fulltext search for exactly one
task in my application, which is search documents and returning them.
Here no faceting at all is need, but only filtering with fields,
which i also use for the other queries.
So if 95% of the queries don't use the fulltext i thought it would
make sense to split them.

Your suggestion to have one main master index and several slave indexes
sounds promising. Is it possible to have this replication in SolrCloud e.g
with different kind of schemas etc?

Thanks. Daniel

On Thu, Jul 26, 2012 at 9:05 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : My thought was, that I could separate indexes. So for the facet queries
 : where I don't need
 : fulltext search (so also no indexed fulltext field) I can use a
 completely
 : new setup of a
 : sharded Solr which doesn't include the indexed fulltext, so the index is
 : kept small containing
 : just the few fields I have.
 :
 : And for the fulltext queries I have the current Solr configuration which
 : includes as mentioned
 : above all the fields incl. the index fulltext field.
 :
 : Is this a normal way of handling these requirements. That there are
 : different kind of
 : Solr configurations for the different needs? Because the huge redundancy

 It's definitley doable -- one thing i'm not clear on is why, if your
 faceting queries don't care about the full text, you would need to leave
 those small fields in your full index ... is your plan to do
 faceting and drill down using the smaller index, but then display docs
 resulting from those queries by using the same fq params when querying
 the full index ?

 if so then it should work, if not -- you may not need those fields in that
 index.

 In general there is nothing wrong with having multiple indexes to solve
 multiple usecases -- an index is usually an inverted denormalization of
 some structured source data designed for fast queries/retrieval.  If there
 are multiple distinct ways you want to query/retrieve data that don't lend
 themselves to the same denormalization, there's nothing wrong with
 multiple denormalizations.

 Something else to consider is an approach i've used many times: having a
 single index, but using special purpose replicas.  You can have a master
 index that you update at the rate of change, one set of slaves that are
 used for one type of query pattern (faceting on X, Y, and Z for example)
 and a differnet set of slaves that are used for a different query pattern
 (faceting on A, B, and C) so each set of slaves gets a higher cahce hit
 rate then if the queries were randomized across all machines

 -Hoss



RE: Bulk indexing data into solr

2012-07-26 Thread Zhang, Lisheng
Hi,

I really appreciate your quick helps!

1) I want to let solr not cache any IndexerReader (hopefully it is possible),
because our app is made of many lucene folders and each of them not very
large, from my previous test it seems that performance is fine if each time
we just create IndexerReader. Hopefully doing this way we have no sync issue?

2) Our data is mainly in RDB (currently in mySQL and will move to Cassendra
later). My main concern is that by using Solr we need to pass rather large 
amount of data through network layer via HTTP, which could be a problem?

Best regards, Lisheng

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: Thursday, July 26, 2012 12:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Bulk indexing data into solr


IIRC about a two month ago problem with such scheme discussed here, but I
can remember exact details.
Scheme is generally correct. But you didn't tell how do you let solr know
that it need to reread new index generation, after indexer fsync segments
get.

btw, it might be a possible issue:
https://lucene.apache.org/core/old_versioned_docs//versions/3_0_1/api/all/org/apache/lucene/index/IndexWriter.html#commit()
 Note that this operation calls Directory.sync on the index files. That
call should not return until the file contents  metadata are on stable
storage. For FSDirectory, this calls the OS's fsync. But, beware: some
hardware devices may in fact cache writes even during fsync, and return
before the bits are actually on stable storage, to give the appearance of
faster performance.

you should ensure that after segments.get is fsync'ed, all other index
files are fsynced for other processes too.

Could you tell more about your data:
what's the format?
whether they are located relatively to indexer?
And why you can't use remote streaming by Solr's upd handler or indexer
client app with StreamingUpdateServer ?

On Thu, Jul 26, 2012 at 10:47 PM, Zhang, Lisheng 
lisheng.zh...@broadvision.com wrote:

 Hi,

 I think at least before lucene 4.0 we can only allow one process/thread to
 write on
 a lucene folder. Based on this fact my initial plan is:

 1) There is one set of lucene index folders.
 2) Solr server only perform queries in those servers
 3) Having a separate process (multi-threads) to index those lucene folders
 (each
folder is a separate app). Only one thread will index one given lucene
 folder.

 Thanks very much for helps, Lisheng


 -Original Message-
 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
 Sent: Thursday, July 26, 2012 10:15 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Bulk indexing data into solr


 Coming back to your original question. I'm puzzled a little.
 It's not clear where you wanna call Lucene API directly from.
 if you mean that you has standalone indexer, which write index files. Then
 it stops and these files become available for Solr Process it will work.
 Sharing index between processes, or using EmbeddedServer is looking for
 problem (despite Lucene has Locks mechanism, which I'm not completely aware
 of).
 I can conclude that your data for indexing is collocate with the solr
 server. In this case consider
 http://wiki.apache.org/solr/ContentStream#RemoteStreaming

 Please give more details about your design.

 On Thu, Jul 26, 2012 at 1:22 PM, Zhang, Lisheng 
 lisheng.zh...@broadvision.com wrote:

 
  Hi,
 
  I am starting to use solr, now I need to index a rather large amount of
  data, it seems
  that calling solr to pass data through HTTP is rather inefficient, I am
  think still call
  lucene API directly for bulk index but to use solr for search, is this
  design OK?
 
  Thanks very much for helps, Lisheng
 
 


 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com




-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Map/Reduce directly against solr4 index.

2012-07-26 Thread Trung Pham
Is it possible to run map reduce jobs directly on Solr4?

I'm asking this because I want to use Solr4 as the primary storage engine.
And I want to be able to run near real time analytics against it as well.
Rather than export solr4 data out to a hadoop cluster.


Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni
Of course you can do it, but the question is whether this will produce
the performance results you expect.
I've seen talk about this in other forums, so you might find some prior
work here.

Solr and HDFS serve somewhat different purposes. The key issue would be
if your map and reduce code
overloads the Solr endpoint. Even using SolrCloud, I believe all
requests will have to go through a single
URL (to be routed), so if you have thousands of map/reduce jobs all
running simultaneously, the question is whether
your Solr is architected to handle that amount of throughput.


On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:

 Is it possible to run map reduce jobs directly on Solr4?
 
 I'm asking this because I want to use Solr4 as the primary storage engine.
 And I want to be able to run near real time analytics against it as well.
 Rather than export solr4 data out to a hadoop cluster.




UUID generation not working

2012-07-26 Thread gopes
Hi 

1.  I am using UUID to generate unique id in my collection but when I tried
to index the collection it could not find any doucmnets.  can you please
tell me how to use UUID in schema.xm

Thanks,
Sarala





--
View this message in context: 
http://lucene.472066.n3.nabble.com/UUID-generation-not-working-tp3997571.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Schmidt Jeff
It's not free (for production use anyway), but you might consider DataStax 
Enterprise: http://www.datastax.com/products/enterprise

It is a very nice consolidation of Cassandra, Solr and Hadoop.  No ETL required.

Cheers,

Jeff

On Jul 26, 2012, at 3:55 PM, Trung Pham wrote:

 Is it possible to run map reduce jobs directly on Solr4?
 
 I'm asking this because I want to use Solr4 as the primary storage engine.
 And I want to be able to run near real time analytics against it as well.
 Rather than export solr4 data out to a hadoop cluster.



Re: leaks in solr

2012-07-26 Thread Mark Miller

On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote:

 Hi Guys
 
 I am also seeing this problem.
 
 I am using SOLR 4 from Trunk and seeing this issue repeat every day.
 
 Any inputs about how to resolve this would be great
 
 -Saroj


Trunk from what date?

- Mark











Re: leaks in solr

2012-07-26 Thread roz dev
it was from 4/11/12

-Saroj

On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote:


 On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote:

  Hi Guys
 
  I am also seeing this problem.
 
  I am using SOLR 4 from Trunk and seeing this issue repeat every day.
 
  Any inputs about how to resolve this would be great
 
  -Saroj


 Trunk from what date?

 - Mark












Re: UUID generation not working

2012-07-26 Thread Chris Hostetter
: 
: 1.  I am using UUID to generate unique id in my collection but when I tried
: to index the collection it could not find any doucmnets.  can you please
: tell me how to use UUID in schema.xm

in general, if you are having a problem achieving a goal, please post what 
you've tried and what kinds of errors/behavior yo uare getting instead -- 
ie: in this case telling us *how* you have already tried using UUID to 
generate unique id would be helpful.

In Solr 3.x, you can use the UUIDField like so...

  fieldType name=uuid class=solr.UUIDField indexed=true /
...
  field name=id type=uuid indexed=true stored=true default=NEW/
...
  uniqueKeyid/uniqueKey

...to generate a new UUID for every doc added.  But for Solr 4.x some 
things have changed, as noted in the Upgrading section for Solr 
4.0.0-ALPHA...


* Due to low level changes to support SolrCloud, the uniqueKey field can no 
  longer be populated via copyField/ or field default=... in the 
  schema.xml.  Users wishing to have Solr automatically generate a uniqueKey 
  value when adding documents should instead use an instance of
  solr.UUIDUpdateProcessorFactory in their update processor chain.  See 
  SOLR-2796 for more details.
...
https://issues.apache.org/jira/browse/SOLR-2796
https://issues.apache.org/jira/browse/SOLR-3495





-Hoss


Re: solr host name on solrconfig.xml

2012-07-26 Thread Chris Hostetter

: i need the host name of my solr-server in my solrconfig.xml
: anybody knows the correct variable?
: 
: something like ${solr.host} or ${solr.host.name} ...
: 
: exists an documantation about ALL available variables in the solr
: namespaces?

Off the top of my head i don't know that there are any system properties 
that solr creates for you in the solr.* namespace -- when you see 
examples of people talking aboutthings like $solr.data.dir that's just 
convention in the example files that you can set when you run Solr and 
solr will *read* that value because you use it in your solrconfig.xml

Any run time java system property should be available when the 
solrconfig.xml is read, and you can get a list of all the properties in 
your system from the Properties link in the Solr Admin UI.  I don't 
think there is a standard java system property for the hostname (machines 
can have multiple hosts, even multiple IPs) but you could always do 
something like...

java -Dsolr.my.hostname=`hostname` -jar start.jar

...when running solr.

-Hoss


Re: leaks in solr

2012-07-26 Thread Mark Miller
I'd take a look at this issue: https://issues.apache.org/jira/browse/SOLR-3392

Fixed late April.

On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote:

 it was from 4/11/12
 
 -Saroj
 
 On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com wrote:
 
 
 On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote:
 
 Hi Guys
 
 I am also seeing this problem.
 
 I am using SOLR 4 from Trunk and seeing this issue repeat every day.
 
 Any inputs about how to resolve this would be great
 
 -Saroj
 
 
 Trunk from what date?
 
 - Mark
 
 
 
 
 
 
 
 
 
 

- Mark Miller
lucidimagination.com













Re: leaks in solr

2012-07-26 Thread roz dev
Thanks Mark.

We are never calling commit or optimize with openSearcher=false.

As per logs, this is what is happening

openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}

--
But, We are going to use 4.0 Alpha and see if that helps.

-Saroj










On Thu, Jul 26, 2012 at 5:12 PM, Mark Miller markrmil...@gmail.com wrote:

 I'd take a look at this issue:
 https://issues.apache.org/jira/browse/SOLR-3392

 Fixed late April.

 On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote:

  it was from 4/11/12
 
  -Saroj
 
  On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote:
 
  Hi Guys
 
  I am also seeing this problem.
 
  I am using SOLR 4 from Trunk and seeing this issue repeat every day.
 
  Any inputs about how to resolve this would be great
 
  -Saroj
 
 
  Trunk from what date?
 
  - Mark
 
 
 
 
 
 
 
 
 
 

 - Mark Miller
 lucidimagination.com














Re: Binary content index with multiple cores

2012-07-26 Thread Chris Hostetter

: Here is my solrconfig.xml for one of the core :
...
:   lib dir=../../dist/ regex=apache-solr-cell-\d.*\.jar /
:   lib dir=../../contrib/extraction/lib regex=.*\.jar /
...
: I've added the maven dependencies like this for the solr war :
... 
:   dependency
:   
groupIdorg.apache.solr/groupId
:   
artifactIdsolr-cell/artifactId
:   
classpathshared/classpath
:   /dependency


Doing both of these things is the precise cause of your problem.

You know have two instances of all of hte solr-cell classes in your 
classpath, at differnet levels of the hierarchy.  Due to the 
excentricities of java classloading, this is causing the classloader to 
not realize that the instance of the ExtractingRequestHandler class that 
it finds is in fact an subclass of the instance of the SolrRequestHandler 
class that it finds.

If you want to modify the war, modify the war.
If you want to load jars as a plugin, load them as plugins.

Under no circumstances should you try to do both with the same jar(s).


-Hoss


Re: leaks in solr

2012-07-26 Thread Karthick Duraisamy Soundararaj
Mark,
We use solr 3.6.0 on freebsd 9. Over a period of time, it
accumulates lots of space!

On Thu, Jul 26, 2012 at 8:47 PM, roz dev rozde...@gmail.com wrote:

 Thanks Mark.

 We are never calling commit or optimize with openSearcher=false.

 As per logs, this is what is happening

 openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}

 --
 But, We are going to use 4.0 Alpha and see if that helps.

 -Saroj










 On Thu, Jul 26, 2012 at 5:12 PM, Mark Miller markrmil...@gmail.com
 wrote:

  I'd take a look at this issue:
  https://issues.apache.org/jira/browse/SOLR-3392
 
  Fixed late April.
 
  On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote:
 
   it was from 4/11/12
  
   -Saroj
  
   On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com
  wrote:
  
  
   On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote:
  
   Hi Guys
  
   I am also seeing this problem.
  
   I am using SOLR 4 from Trunk and seeing this issue repeat every day.
  
   Any inputs about how to resolve this would be great
  
   -Saroj
  
  
   Trunk from what date?
  
   - Mark
  
  
  
  
  
  
  
  
  
  
 
  - Mark Miller
  lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 



Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Trung Pham
I think the performance should be close to Hadoop running on HDFS, if
somehow Hadoop job can directly read the Solr Index file while executing
the job on the local solr node.

Kindna like how HBase and Cassadra integrate with Hadoop.

Plus, we can run the map reduce job on a standby Solr4 cluster.

This way, the documents in Solr will be our primary source of truth. And we
have the ability to run near real time search queries and analytics on it.
No need to export data around.

Solr4 is becoming a very interesting solution to many web scale problems.
Just missing the map/reduce component. :)

On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote:

 Of course you can do it, but the question is whether this will produce
 the performance results you expect.
 I've seen talk about this in other forums, so you might find some prior
 work here.

 Solr and HDFS serve somewhat different purposes. The key issue would be
 if your map and reduce code
 overloads the Solr endpoint. Even using SolrCloud, I believe all
 requests will have to go through a single
 URL (to be routed), so if you have thousands of map/reduce jobs all
 running simultaneously, the question is whether
 your Solr is architected to handle that amount of throughput.


 On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:

  Is it possible to run map reduce jobs directly on Solr4?
 
  I'm asking this because I want to use Solr4 as the primary storage
 engine.
  And I want to be able to run near real time analytics against it as well.
  Rather than export solr4 data out to a hadoop cluster.





Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni
You raise an interesting possibility. A map/reduce solr handler over
solrcloud...

On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:

 I think the performance should be close to Hadoop running on HDFS, if
 somehow Hadoop job can directly read the Solr Index file while executing
 the job on the local solr node.
 
 Kindna like how HBase and Cassadra integrate with Hadoop.
 
 Plus, we can run the map reduce job on a standby Solr4 cluster.
 
 This way, the documents in Solr will be our primary source of truth. And we
 have the ability to run near real time search queries and analytics on it.
 No need to export data around.
 
 Solr4 is becoming a very interesting solution to many web scale problems.
 Just missing the map/reduce component. :)
 
 On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  Of course you can do it, but the question is whether this will produce
  the performance results you expect.
  I've seen talk about this in other forums, so you might find some prior
  work here.
 
  Solr and HDFS serve somewhat different purposes. The key issue would be
  if your map and reduce code
  overloads the Solr endpoint. Even using SolrCloud, I believe all
  requests will have to go through a single
  URL (to be routed), so if you have thousands of map/reduce jobs all
  running simultaneously, the question is whether
  your Solr is architected to handle that amount of throughput.
 
 
  On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
 
   Is it possible to run map reduce jobs directly on Solr4?
  
   I'm asking this because I want to use Solr4 as the primary storage
  engine.
   And I want to be able to run near real time analytics against it as well.
   Rather than export solr4 data out to a hadoop cluster.
 
 
 




Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Lance Norskog
Mahout includes a file reader for Lucene indexes. It will read from
HDFS or local disks.

On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com wrote:
 You raise an interesting possibility. A map/reduce solr handler over
 solrcloud...

 On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:

 I think the performance should be close to Hadoop running on HDFS, if
 somehow Hadoop job can directly read the Solr Index file while executing
 the job on the local solr node.

 Kindna like how HBase and Cassadra integrate with Hadoop.

 Plus, we can run the map reduce job on a standby Solr4 cluster.

 This way, the documents in Solr will be our primary source of truth. And we
 have the ability to run near real time search queries and analytics on it.
 No need to export data around.

 Solr4 is becoming a very interesting solution to many web scale problems.
 Just missing the map/reduce component. :)

 On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote:

  Of course you can do it, but the question is whether this will produce
  the performance results you expect.
  I've seen talk about this in other forums, so you might find some prior
  work here.
 
  Solr and HDFS serve somewhat different purposes. The key issue would be
  if your map and reduce code
  overloads the Solr endpoint. Even using SolrCloud, I believe all
  requests will have to go through a single
  URL (to be routed), so if you have thousands of map/reduce jobs all
  running simultaneously, the question is whether
  your Solr is architected to handle that amount of throughput.
 
 
  On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
 
   Is it possible to run map reduce jobs directly on Solr4?
  
   I'm asking this because I want to use Solr4 as the primary storage
  engine.
   And I want to be able to run near real time analytics against it as well.
   Rather than export solr4 data out to a hadoop cluster.
 
 
 





-- 
Lance Norskog
goks...@gmail.com


Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Trung Pham
Can it read distributed lucene indexes in SolrCloud?
On Jul 26, 2012 7:11 PM, Lance Norskog goks...@gmail.com wrote:

 Mahout includes a file reader for Lucene indexes. It will read from
 HDFS or local disks.

 On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com
 wrote:
  You raise an interesting possibility. A map/reduce solr handler over
  solrcloud...
 
  On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
 
  I think the performance should be close to Hadoop running on HDFS, if
  somehow Hadoop job can directly read the Solr Index file while executing
  the job on the local solr node.
 
  Kindna like how HBase and Cassadra integrate with Hadoop.
 
  Plus, we can run the map reduce job on a standby Solr4 cluster.
 
  This way, the documents in Solr will be our primary source of truth.
 And we
  have the ability to run near real time search queries and analytics on
 it.
  No need to export data around.
 
  Solr4 is becoming a very interesting solution to many web scale
 problems.
  Just missing the map/reduce component. :)
 
  On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com
 wrote:
 
   Of course you can do it, but the question is whether this will produce
   the performance results you expect.
   I've seen talk about this in other forums, so you might find some
 prior
   work here.
  
   Solr and HDFS serve somewhat different purposes. The key issue would
 be
   if your map and reduce code
   overloads the Solr endpoint. Even using SolrCloud, I believe all
   requests will have to go through a single
   URL (to be routed), so if you have thousands of map/reduce jobs all
   running simultaneously, the question is whether
   your Solr is architected to handle that amount of throughput.
  
  
   On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
  
Is it possible to run map reduce jobs directly on Solr4?
   
I'm asking this because I want to use Solr4 as the primary storage
   engine.
And I want to be able to run near real time analytics against it as
 well.
Rather than export solr4 data out to a hadoop cluster.
  
  
  
 
 



 --
 Lance Norskog
 goks...@gmail.com



Re: leaks in solr

2012-07-26 Thread Lance Norskog
What does the Statistics page in the Solr admin say? There might be
several searchers open: org.apache.solr.search.SolrIndexSearcher

Each searcher holds open different generations of the index. If
obsolete index files are held open, it may be old searchers. How big
are the caches? How long does it take to autowarm them?

On Thu, Jul 26, 2012 at 6:15 PM, Karthick Duraisamy Soundararaj
karthick.soundara...@gmail.com wrote:
 Mark,
 We use solr 3.6.0 on freebsd 9. Over a period of time, it
 accumulates lots of space!

 On Thu, Jul 26, 2012 at 8:47 PM, roz dev rozde...@gmail.com wrote:

 Thanks Mark.

 We are never calling commit or optimize with openSearcher=false.

 As per logs, this is what is happening

 openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}

 --
 But, We are going to use 4.0 Alpha and see if that helps.

 -Saroj










 On Thu, Jul 26, 2012 at 5:12 PM, Mark Miller markrmil...@gmail.com
 wrote:

  I'd take a look at this issue:
  https://issues.apache.org/jira/browse/SOLR-3392
 
  Fixed late April.
 
  On Jul 26, 2012, at 7:41 PM, roz dev rozde...@gmail.com wrote:
 
   it was from 4/11/12
  
   -Saroj
  
   On Thu, Jul 26, 2012 at 4:21 PM, Mark Miller markrmil...@gmail.com
  wrote:
  
  
   On Jul 26, 2012, at 3:18 PM, roz dev rozde...@gmail.com wrote:
  
   Hi Guys
  
   I am also seeing this problem.
  
   I am using SOLR 4 from Trunk and seeing this issue repeat every day.
  
   Any inputs about how to resolve this would be great
  
   -Saroj
  
  
   Trunk from what date?
  
   - Mark
  
  
  
  
  
  
  
  
  
  
 
  - Mark Miller
  lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 
 




-- 
Lance Norskog
goks...@gmail.com


Re: Updating a SOLR index with a properties file

2012-07-26 Thread Lance Norskog
You can use the DataImportHandler. The DIH file would use a file
reader, then the line reader tool, then separate the line with a
regular expression into two fields. If you need a unique ID, look up
the UUID tools.

I have never heard of this use case.

On Thu, Jul 26, 2012 at 1:56 PM, Florian Popescu
florian.pope...@gmail.com wrote:
 I am not sure if this is already possible with the built in set of request
 handlers. I am trying to update the index using a properties file (one
 document per file). Is this something that can be done? I searched the wiki
 and none of the stuff there seems to be addressing this.

 Thanks in advance,
 Florian



-- 
Lance Norskog
goks...@gmail.com


Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Lance Norskog
No. This is just a Hadoop file input class. Distributed Hadoop has to
get files from a distributed file service. It sounds like you want
some kind of distributed file service that maps a TaskNode (??) on a
given server to the files available on that server. There might be
something that does this. HDFS works very hard at doing this; are you
sure it is not good enough? I am endlessly amazed at the speed of
these distributed apps.

Have you done a proof of concept?

On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham tr...@phamcom.com wrote:
 Can it read distributed lucene indexes in SolrCloud?
 On Jul 26, 2012 7:11 PM, Lance Norskog goks...@gmail.com wrote:

 Mahout includes a file reader for Lucene indexes. It will read from
 HDFS or local disks.

 On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com
 wrote:
  You raise an interesting possibility. A map/reduce solr handler over
  solrcloud...
 
  On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
 
  I think the performance should be close to Hadoop running on HDFS, if
  somehow Hadoop job can directly read the Solr Index file while executing
  the job on the local solr node.
 
  Kindna like how HBase and Cassadra integrate with Hadoop.
 
  Plus, we can run the map reduce job on a standby Solr4 cluster.
 
  This way, the documents in Solr will be our primary source of truth.
 And we
  have the ability to run near real time search queries and analytics on
 it.
  No need to export data around.
 
  Solr4 is becoming a very interesting solution to many web scale
 problems.
  Just missing the map/reduce component. :)
 
  On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com
 wrote:
 
   Of course you can do it, but the question is whether this will produce
   the performance results you expect.
   I've seen talk about this in other forums, so you might find some
 prior
   work here.
  
   Solr and HDFS serve somewhat different purposes. The key issue would
 be
   if your map and reduce code
   overloads the Solr endpoint. Even using SolrCloud, I believe all
   requests will have to go through a single
   URL (to be routed), so if you have thousands of map/reduce jobs all
   running simultaneously, the question is whether
   your Solr is architected to handle that amount of throughput.
  
  
   On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
  
Is it possible to run map reduce jobs directly on Solr4?
   
I'm asking this because I want to use Solr4 as the primary storage
   engine.
And I want to be able to run near real time analytics against it as
 well.
Rather than export solr4 data out to a hadoop cluster.
  
  
  
 
 



 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com


Re: Updating a SOLR index with a properties file

2012-07-26 Thread Florian Popescu
Thanks! I will try I out and see how it works. 

This is for indexing a bunch of java resource bundles and trying to 'refactor' 
the keys. Basically trying to figure out if a key is used in multiple places 
and extracting it out if applicable. 

Florian 


On Jul 26, 2012, at 10:46 PM, Lance Norskog goks...@gmail.com wrote:

 You can use the DataImportHandler. The DIH file would use a file
 reader, then the line reader tool, then separate the line with a
 regular expression into two fields. If you need a unique ID, look up
 the UUID tools.
 
 I have never heard of this use case.
 
 On Thu, Jul 26, 2012 at 1:56 PM, Florian Popescu
 florian.pope...@gmail.com wrote:
 I am not sure if this is already possible with the built in set of request
 handlers. I am trying to update the index using a properties file (one
 document per file). Is this something that can be done? I searched the wiki
 and none of the stuff there seems to be addressing this.
 
 Thanks in advance,
 Florian
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com


Re: Significance of Analyzer Class attribute

2012-07-26 Thread Chris Hostetter

:  When I specify analyzer class in schema,  something
:  like below and do
:  analysis on this field in analysis page : I cant  see
:  verbose output on
:  tokenizer and filters

The reason for that is that if you use an explicit Analyzer 
implimentation, the analysis tool doesn't know what the individual phases 
of hte tokenfilters are -- the Analyzer API doesn't expose that 
information (some Analyzers may be monolithic and not made up of 
individual TokenFilters)


 :  fieldType name=text_chinese
:  class=solr.TextField
:        analyzer
:  class=org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
:    tokenizer
...
 
: Above config is somehow wrong. You cannot use both analyzer combined 
: with tokenizer and filter altogether. If you want to use lucene analyzer 
: in schema.xml there should be only analyzer definition.

Right.  what's happening here is htat since a class is specifid for hte 
analyzer, it is ignoring the tokenizer+tokenfilters listed.  I've opened a 
bug to add better error checking to catch these kinds of configuration 
mistakes...

https://issues.apache.org/jira/browse/SOLR-3683


-Hoss

Re: Significance of Analyzer Class attribute

2012-07-26 Thread Rajani Maski
Hi All,

  Thank you for the replies.



--Regards
Rajani


On Fri, Jul 27, 2012 at 9:58 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 :  When I specify analyzer class in schema,  something
 :  like below and do
 :  analysis on this field in analysis page : I cant  see
 :  verbose output on
 :  tokenizer and filters

 The reason for that is that if you use an explicit Analyzer
 implimentation, the analysis tool doesn't know what the individual phases
 of hte tokenfilters are -- the Analyzer API doesn't expose that
 information (some Analyzers may be monolithic and not made up of
 individual TokenFilters)


  :  fieldType name=text_chinese
 :  class=solr.TextField
 :analyzer
 :  class=org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
 :tokenizer
 ...

 : Above config is somehow wrong. You cannot use both analyzer combined
 : with tokenizer and filter altogether. If you want to use lucene analyzer
 : in schema.xml there should be only analyzer definition.

 Right.  what's happening here is htat since a class is specifid for hte
 analyzer, it is ignoring the tokenizer+tokenfilters listed.  I've opened a
 bug to add better error checking to catch these kinds of configuration
 mistakes...

 https://issues.apache.org/jira/browse/SOLR-3683


 -Hoss


Re: solr host name on solrconfig.xml

2012-07-26 Thread stockii
okay. thx. 

i knw this way but its not so nice :P

i set a new variable in my core.properties file which i load in solr.xml for
each core =))



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-host-name-on-solrconfig-xml-tp3997371p3997652.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Trung Pham
That is exactly what I want.

I want the distributed Hadoop TaskNode to be running on the same server
that is holding the local distributed solr index. This way there is no need
to move any data around... I think other people call this feature 'data
locality' of map/reduce.

I believe HBase and Hadoop integration work exactly like this. The only
difference here is we are substituting HDFS with the distributed Solr
indexes.

Since solr4 can manage the sharded/distributed index files, it's doing the
exact work that HDFS is doing. In theory, this should be achievable.

On Thu, Jul 26, 2012 at 7:51 PM, Lance Norskog goks...@gmail.com wrote:

 No. This is just a Hadoop file input class. Distributed Hadoop has to
 get files from a distributed file service. It sounds like you want
 some kind of distributed file service that maps a TaskNode (??) on a
 given server to the files available on that server. There might be
 something that does this. HDFS works very hard at doing this; are you
 sure it is not good enough? I am endlessly amazed at the speed of
 these distributed apps.

 Have you done a proof of concept?

 On Thu, Jul 26, 2012 at 7:40 PM, Trung Pham tr...@phamcom.com wrote:
  Can it read distributed lucene indexes in SolrCloud?
  On Jul 26, 2012 7:11 PM, Lance Norskog goks...@gmail.com wrote:
 
  Mahout includes a file reader for Lucene indexes. It will read from
  HDFS or local disks.
 
  On Thu, Jul 26, 2012 at 6:57 PM, Darren Govoni dar...@ontrenet.com
  wrote:
   You raise an interesting possibility. A map/reduce solr handler over
   solrcloud...
  
   On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:
  
   I think the performance should be close to Hadoop running on HDFS, if
   somehow Hadoop job can directly read the Solr Index file while
 executing
   the job on the local solr node.
  
   Kindna like how HBase and Cassadra integrate with Hadoop.
  
   Plus, we can run the map reduce job on a standby Solr4 cluster.
  
   This way, the documents in Solr will be our primary source of truth.
  And we
   have the ability to run near real time search queries and analytics
 on
  it.
   No need to export data around.
  
   Solr4 is becoming a very interesting solution to many web scale
  problems.
   Just missing the map/reduce component. :)
  
   On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com
  wrote:
  
Of course you can do it, but the question is whether this will
 produce
the performance results you expect.
I've seen talk about this in other forums, so you might find some
  prior
work here.
   
Solr and HDFS serve somewhat different purposes. The key issue
 would
  be
if your map and reduce code
overloads the Solr endpoint. Even using SolrCloud, I believe all
requests will have to go through a single
URL (to be routed), so if you have thousands of map/reduce jobs all
running simultaneously, the question is whether
your Solr is architected to handle that amount of throughput.
   
   
On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
   
 Is it possible to run map reduce jobs directly on Solr4?

 I'm asking this because I want to use Solr4 as the primary
 storage
engine.
 And I want to be able to run near real time analytics against it
 as
  well.
 Rather than export solr4 data out to a hadoop cluster.
   
   
   
  
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



 --
 Lance Norskog
 goks...@gmail.com



Re: Solr - hl.fragsize Issue

2012-07-26 Thread meghana
Hi @iorixxx , I use DefaultSolrHighlighter , and yes fragment size also
includes em tags but if we remove em from fragment , then also the
average size of fragment is 110 instead of 100. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-hl-fragsize-Issue-tp3997457p3997656.html
Sent from the Solr - User mailing list archive at Nabble.com.