Re: large scale indexing issues / single threaded bottleneck

2011-10-29 Thread Michael McCandless
On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:

 one more thing, after somebody (thanks robert) pointed me at the
 stacktrace it seems kind of obvious what the root cause of your
 problem is. Its solr :) Solr closes the IndexWriter on commit which is
 very wasteful since you basically wait until all merges are done. Solr
 trunk has solved this problem.

That is very wasteful but I don't think it's actually the cause of the
slowdown here?

The cause looks like it's in applying deletes, which even once Solr
stops closing the IW will still occur (ie, IW.commit must also resolve
all deletes).

When IW resolves deletes it 1) opens a SegmentReader for each segment
in the index, and 2) looks up each deleted term and mark its
document(s) as deleted.

I saw a mention somewhere that you can tell Solr not to use
IW.addDocument (not IW.updateDocument) when you add a document if you
are certain it's not replacing a previous document with the same ID --
I don't know how to do that but if that's true, and you are truly only
adding documents, that could be the easiest fix here.

Failing that... you could try increasing
IndexWriterConfig.setReaderTermsIndexDivisor (not sure if/how this is
exposed in Solr's config)... this will make init time and RAM usage
for each SegmentReader faster, but lookup time slower; whether this
helps depends on if your slowness is in opening the SegmentReader (how
long does it take to IR.open on your index?) or on resolving the
deletes once SR is open.

Do you have a great many terms in your index?  Can you run CheckIndex
and post the output?  (If so this might mean you have an analysis
problem, ie, putting too many terms in the index).

 We should maybe try to fix this in 3.x too?

+1; having to wait for running merges to complete when the app calls
commit is crazy (Lucene long ago removed that limitation).

Mike McCandless

http://blog.mikemccandless.com


Re: URL Redirect

2011-10-29 Thread Erik Hatcher
I would personally implement this in the app tier, above Solr.  

One way to do it using Solr to match keywords to URLs is to index special 
redirect documents with the keywords in the search field (either in the main 
index, or in a separate core index).  

But there is nothing magically built into Solr, at the moment, to do what 
you're asking out of the box. 

I'm curious what other tasks are tedious about migrating from Endeca to Solr.  

Erik



On Oct 28, 2011, at 23:40 , prr wrote:

 Finotti Simone tech178 at yoox.com writes:
 
 
 Hello,
 
 I have been assigned the task to migrate from Endeca to Solr.
 
 The former engine allowed me to set keyword triggers that, when matched
 exactly, caused the web client to
 redirect to a specified URL.
 
 Does that feature exist in Solr? If so, where can I get some info?
 
 Thank you
 
 
 
 Hi, Iam also looking out for migrating from Endeca to Solr , but on the first
 look it looks extremely tedious to me ...please pass on any tips or how to
 approach the problem..
 
 
 



Uncomplete date expressions

2011-10-29 Thread Erik Fäßler
Hi all,

I want to index MEDLINE documents which not always contain complete dates of 
publication. The year is known always. Now the Solr documentation states, dates 
must have the format 1995-12-31T23:59:59Z for which month, day and even the 
time of the day must be known.
I could, of course, just complement uncomplete dates with default values, 01-01 
for example. But then I won't be able to distinguish between complete and 
uncomplete dates afterwards which is of importance when displaying the 
documents.

I could just store the known information, e.g. the year, into an integer-typed 
field, but then I won't have date math.

Is there a good solution to my problem? Probably I'm just missing the obvious, 
perhaps you can help me :-)

Best regards,

Erik

Re: Uncomplete date expressions

2011-10-29 Thread François Schiettecatte
Erik

I would complement the date with default values as you suggest and store a 
boolean flag indicating whether the date was complete or not, or store the 
original date if it is not complete which would probably be better because the 
presence of that data would tell you that the original date was not complete 
and you would also have it too.

Cheers

François

On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote:

 Hi all,
 
 I want to index MEDLINE documents which not always contain complete dates of 
 publication. The year is known always. Now the Solr documentation states, 
 dates must have the format 1995-12-31T23:59:59Z for which month, day and 
 even the time of the day must be known.
 I could, of course, just complement uncomplete dates with default values, 
 01-01 for example. But then I won't be able to distinguish between complete 
 and uncomplete dates afterwards which is of importance when displaying the 
 documents.
 
 I could just store the known information, e.g. the year, into an 
 integer-typed field, but then I won't have date math.
 
 Is there a good solution to my problem? Probably I'm just missing the 
 obvious, perhaps you can help me :-)
 
 Best regards,
 
   Erik



Re: Uncomplete date expressions

2011-10-29 Thread Erik Fäßler
Hello François,

thank you for your quick reply. I thought about just storing which information 
I am lacking and this would be a possibility of course. It just seemed a bit 
like quickdirty to me and I wondered whether Solr really cannot understand 
dates which only consist of the year. Isn't it a common case that a date/time 
expression is not determined to the hour, for example? But if there is no other 
possibility I will stick with your suggestion, thank you!

Best,

Erik

Am 29.10.2011 um 15:20 schrieb François Schiettecatte:

 Erik
 
 I would complement the date with default values as you suggest and store a 
 boolean flag indicating whether the date was complete or not, or store the 
 original date if it is not complete which would probably be better because 
 the presence of that data would tell you that the original date was not 
 complete and you would also have it too.
 
 Cheers
 
 François
 
 On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote:
 
 Hi all,
 
 I want to index MEDLINE documents which not always contain complete dates of 
 publication. The year is known always. Now the Solr documentation states, 
 dates must have the format 1995-12-31T23:59:59Z for which month, day and 
 even the time of the day must be known.
 I could, of course, just complement uncomplete dates with default values, 
 01-01 for example. But then I won't be able to distinguish between complete 
 and uncomplete dates afterwards which is of importance when displaying the 
 documents.
 
 I could just store the known information, e.g. the year, into an 
 integer-typed field, but then I won't have date math.
 
 Is there a good solution to my problem? Probably I'm just missing the 
 obvious, perhaps you can help me :-)
 
 Best regards,
 
  Erik
 



Re: large scale indexing issues / single threaded bottleneck

2011-10-29 Thread Yonik Seeley
On Sat, Oct 29, 2011 at 6:35 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 I saw a mention somewhere that you can tell Solr not to use
 IW.addDocument (not IW.updateDocument) when you add a document if you
 are certain it's not replacing a previous document with the same ID

Right - adding overwrite=false to the URL when updating should do this.

-Yonik
http://www.lucidimagination.com


Re: large scale indexing issues / single threaded bottleneck

2011-10-29 Thread Nagendra Nagarajayya

Roman:


2) what would be the best way to port these (and only these) changes

to 3.4.0? I tried to dig into the branching and revisions, but got
lost quickly. Tried something like svn diff
[…]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
sure if it's even possible to merge these into 3.4.0


3) what would you recommend for production 24/7 use? 3.4.0?



If you want try real time indexing without commits with ver 3.4.0, you 
can give Solr with RankingAlgorithm a try. It does not need commits 
during to add documents ( you can set your commits to every 15 mins or 
as desired ).


You can get more information about NRT with 3.4.0 from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org


On 10/28/2011 11:38 AM, Roman Alekseenkov wrote:

Hi everyone,

I'm looking for some help with Solr indexing issues on a large scale.

We are indexing few terabytes/month on a sizeable Solr cluster (8
masters / serving writes, 16 slaves / serving reads). After certain
amount of tuning we got to the point where a single Solr instance can
handle index size of 100GB without much issues, but after that we are
starting to observe noticeable delays on index flush and they are
getting larger. See the attached picture for details, it's done for a
single JVM on a single machine.

We are posting data in 8 threads using javabin format and doing commit
every 5K documents, merge factor 20, and ram buffer size about 384MB.
 From the picture it can be seen that a single-threaded index flushing
code kicks in on every commit and blocks all other indexing threads.
The hardware is decent (12 physical / 24 virtual cores per machine)
and it is mostly idle when the index is flushing. Very little CPU
utilization and disk I/O (5%), with the exception of a single CPU
core which actually does index flush (95% CPU, 5% I/O wait).

My questions are:

1) will Solr changes from real-time branch help to resolve these
issues? I was reading
http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
and it looks like we have exactly the same problem

2) what would be the best way to port these (and only these) changes
to 3.4.0? I tried to dig into the branching and revisions, but got
lost quickly. Tried something like svn diff
[…]realtime_search@r953476 […]realtime_search@r1097767, but I'm not
sure if it's even possible to merge these into 3.4.0

3) what would you recommend for production 24/7 use? 3.4.0?

4) is there a workaround that can be used? also, I listed the stack trace below

Thank you!
Roman

P.S. This single index flushing thread spends 99% of all the time in
org.apache.lucene.index.BufferedDeletesStream.applyDeletes, and then
the merge seems to go quickly. I looked it up and it looks like the
intent here is deleting old commit points (we are keeping only 1
non-optimized commit point per config). Not sure why is it taking that
long.

pool-2-thread-1 [RUNNABLE] CPU time: 3:31
java.nio.Bits.copyToByteArray(long, Object, long, long)
java.nio.DirectByteBuffer.get(byte[], int, int)
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int)
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.TermInfosReader.init(Directory, String,
FieldInfos, int, int)
org.apache.lucene.index.SegmentCoreReaders.init(SegmentReader,
Directory, SegmentInfo, int, int)
org.apache.lucene.index.SegmentReader.get(boolean, Directory,
SegmentInfo, int, boolean, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
boolean, int, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
List)
org.apache.lucene.index.IndexWriter.doFlush(boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask$Sync.innerRun()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()




difference between analysis output and searches

2011-10-29 Thread Robert Petersen
Why is it that I can see in the analysis admin page an obvious match
between terms, yet sometimes they don't come back in searches?  Debug
output on the searches indicate a non-match yet the analysis page shows
an obvious match.  I don't get it.



Re: difference between analysis output and searches

2011-10-29 Thread Erik Hatcher
Robert -

Can you give us a concrete input text, the field type definition, and the 
query(/ies) that you'd expect to match?  The devil is in the details.

Just because analysis.jsp _only_ means that an index and query time output 
token for the given text was equal.  But in the real world of doing a search, 
the query parser adds a whole other level of processing.  analysis.jsp does not 
do query parsing and thus can be misleading.

Erik


On Oct 29, 2011, at 13:45 , Robert Petersen wrote:

 Why is it that I can see in the analysis admin page an obvious match
 between terms, yet sometimes they don't come back in searches?  Debug
 output on the searches indicate a non-match yet the analysis page shows
 an obvious match.  I don't get it.
 



shingles and dismax?

2011-10-29 Thread Vijay Ramachandran
Hello. While trying to understand why phrase match and boost was not
working with shingles and the dismax parser, I saw this thread -
http://lucene.472066.n3.nabble.com/Local-Params-syntax-not-protecting-Shingles-in-DisMax-from-Lucene-query-parser-td1563090.html

It states I really like the DisMax query parser, but of course its main
design is a bit at odds with shingles and phrases.

What are these issues? I thought that I'd use shingles in conjunction with
higher pf boost and dismax to get better phrase matches, but its just not
working - the shingle field is almost never seen to match, and I have no
idea why!

thanks,
vijay

-- 
Performance marketing on Twitter - http://www.wisdomtap.com/