AW: Highlighting in large text fields

2007-06-25 Thread Burkamp, Christian
Hi Mike,

Thanks for the quick help. I just added a call to 
Highlighter.setMaxDocBytesToAnalyze() to my local copy of the 
HighlightingUtil.java and it worked all right. It would be great to have the 
limit for the docBytesToAnalyze configurable in solrconfig.xml. (But it's out 
of scope for me to implement this right now).

--Christian

-Ursprüngliche Nachricht-
Von: Mike Klaas [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 25. Juni 2007 19:34
An: solr-user@lucene.apache.org
Betreff: Re: Highlighting in large text fields

On 25-Jun-07, at 4:59 AM, Burkamp, Christian wrote:

 Hi list,

 Highlighting does not work for words that are not located near the 
 beginning of a text field.
 In my index the whole text is stored in a text field for highlighting 
 purpose. This field is just stored but not indexed. The maxFieldLength 
 was set to 10.
 The document content can be retrieved from the index without any 
 problem but for some terms highlighting does not return anything. This 
 is the case for all words from position 9162 on.
 When I try to highlight the whole text (hl.fragsize=0) with some 
 common word as query it returns the highlighted content but just the 
 first
 9162
 words. The rest is omitted.

 Any idea what might be going wrong? 9162 seems not to be a standard 
 limit for IT systems.

The lucene highlighter by default only processes the first 50kB of  
text.  This is probably something that should be made configurable.   
I'll add it to the future features.

-Mike


Problem with surrogate characters in utf-8

2007-06-14 Thread Burkamp, Christian
Hi all,

I have a problem after updating to solr 1.2. I'm using the bundled jetty
that comes with the latest solr release.
Some of the contents that are stored in my index contain characters from
the unicode private section above 0x10. (They are used by some
proprietary software and the text extraction does not throw them out).
Contrasting to solr 1.1, the current release returns these characters
coded as sequence of two surrogate characters. This could result from
some utf-16 conversion that is taking place somewhere in the system? In
fact a look into the index with luke suggests that lucene is storing
it's data in utf-16 encoding. The code point 0x100058 is stored as the
two surrogate characters 0xDBC0 and 0xDC58. This is the same behaviour
in solr 1.1 and 1.2. But while in solr 1.1 the character is put together
to form one 4-byte utf-8 character in the result, solr 1.2 returns the
utf-8 codes for the two surrogate characters that I see using luke.
Unfortunately this results in an invalid utf-8 encoded text that (for
example) can not be displayed by Internet Explorer.
A request like http://localhost:8983/solr/select?q=*:* results in an
error message from the browser.

This is easy to reproduce if someone would try to debug. I have attached
a valid utf-8 encoded xml document that contains the 4-byte encoded
codepoint 0x100058. It can be indexed with post.jar. Sending this
request via Internet Explorer now results in an error:
http://localhost:8983/solr/select?q=*:*

 utf.xml 
I tried the new solr 1.2 war file with the old example distribution
(solr 1.1 and jetty 5.1). Suprisingly enough this does not reveal the
problem. So the whole story might even be a jetty issue.

Any ideas?

-- Christian
?xml version=1.0 encoding=UTF-8?
add
doc
field name=idUTF8TEST/field
field name=nameabcdefg􀁘hijklmnop/field
/doc
/add


AW: SOLR Indexing/Querying

2007-05-31 Thread Burkamp, Christian
Hi there,

It looks alot like using Solr's standard WordDelimiterFilter (see the sample 
schema.xml) does what you need.
It splits on alphabetical to numeric boundaries and on the various kinds of 
intra word delimiters like -, _ or .. You can decide whether the parts 
are put together again in addition to the split up tokens. Control this by the 
parameters catenateWords, catenateNumbers and catenateAll.
Good documentation on this topic is found on the wiki

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089

-- Christian


-Ursprüngliche Nachricht-
Von: Frans Flippo [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 31. Mai 2007 11:27
An: solr-user@lucene.apache.org
Betreff: Re: SOLR Indexing/Querying


I think if you add a field that has an analyzer that creates tokens on 
alpha/digit/punctuation boundaries, that should go a long way. Use that both at 
index and search time.

For example:
* 3555LHP  becomes 3555 LHP
  Searching for D3555 becomes D OR 3555, so it matches on token 3555 from 
3555LHP.

* t14240 becomes t 14240
  Searching for t14240-ss  becomes t OR 14240 OR ss, matching 14240 
from t14240.

Similarly for your other examples.

If this proves to be too broad, you may need to define some stricter rules, but 
you could use this for starters.

I think you will have to write your own analyzer, as it doesn't look like any 
of the analyzers available in Solr/Lucene do exactly what you need. But that's 
relatively straightforward. Just start with the code from one of the existing 
Analyzers (e.g. KeywordAnalyzer).

Good luck,
Frans

On 5/31/07, realw5 [EMAIL PROTECTED] wrote:


 Hey Guys,
 I need some guidance in regards to a problem we are having with our 
 solr index. Below is a list of terms our customers search for, which 
 are failing or not returning the complete set. The second side of the 
 list is the product id/keyword we want it to match.

 Can you give me some direction on how this can (or let me know if i 
 can't be
 done) with index/query analyzers. Any help is much appeciated!

 Dan

 ---

 Keyword Typed In / We want it to find

 D3555 / 3555LHP
 D460160-BN / D460160
 D460160BN / D460160
 Dd454557 / D454557
 84200ORB / 84200
 84200-ORB / 84200
 T13420-SCH / T13420
 t14240-ss / t14240
 --
 View this message in context: 
 http://www.nabble.com/SOLR-Indexing-Querying-tf3843221.html#a10883456
 Sent from the Solr - User mailing list archive at Nabble.com.





AW: UTF-8 2-byte vs 4-byte encodings

2007-05-02 Thread Burkamp, Christian
Gereon,

The four bytes do not look like a valid utf-8 encoded character. 4-byte 
characters in utf-8 start with the binary sequence 0 (For reference 
see the excellent wikipedia article on utf-8 encoding).
Your problem looks like someone interpreted your valid 2-byte utf-8 encoded 
character as two single byte characters in some fancy encoding. This happens if 
you send XML updates to solr via http without setting the encoding properly. It 
is not sufficient to set the encoding in the XML but you need an additional 
HTTP header to set the encoding (Content-type: text/xml; charset=UTF-8)

--Christian

-Ursprüngliche Nachricht-
Von: Gereon Steffens [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 2. Mai 2007 09:59
An: solr-user@lucene.apache.org
Betreff: UTF-8 2-byte vs 4-byte encodings


Hi,

I have a question regarding UTF-8 encodings, illustrated by the 
utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for 
example the e acute character, represented as two bytes 0xC3 0xA9. When this 
file is added to Solar and retrieved later, the XML output contains a four-byte 
representation of that character, namely 0xC2 0x83 0xC2 0xA9.

If, on the other hand, the input data contains this same character as an entity 
#A9; the output contains the two-byte encoded representation 0xC3 0xA9.

Why is that so, and is there a way to always get characters like these out of 
Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in my 
input files that contain raw (two-byte) UTF8 characters that can't be encoded 
as entities.

Thanks,
Gereon



AW: Help with Setup

2007-04-27 Thread Burkamp, Christian
Hi,

You can use curl with a file if you put the @ char in front of it's name. 
(Otherwise curl expects the data on the commandline).

curl http://localhost:8080/solr/update --data-binary @articles.xml

-Ursprüngliche Nachricht-
Von: Sean Bowman [mailto:[EMAIL PROTECTED] 
Gesendet: Donnerstag, 26. April 2007 23:32
An: solr-user@lucene.apache.org
Betreff: Re: Help with Setup

Try:

curl http://localhost:8080/solr/update --data-binary 'adddocfield 
name=id2008/fieldfield name=storyTextThe Rain in Spain Falls Mainly In 
The Plain/
field/doc/add'

And see if that works.  I don't think curl lets you put a filename in for the 
--data-binary parameter.  Has to be the actual data, though something like this 
might also work:

curl http://localhost:8080/solr/update --data-binary `cat articles.xml`

Those are open ticks, not apostrophes.

On 4/26/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 
  paladin:/data/solr mtorgler1$ curl http://localhost:8080/solr/update 
  --data-binary articles.xml result 
  status=1org.xmlpull.v1.XmlPullParserException: only whitespace 
  content allowed before start tag and not a (position:
  START_DOCUMENT seen a... @1:1)
  at org.xmlpull.mxp1.MXParser.parseProlog(MXParser.java:1519)
  at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1395)

 My guess is you have some funny character at the start of the document.
   I have seen funny chars show show up when i edit a UTF-8 file and 
 save it as ASCII.  If you don't see it in your normal editor, try a 
 different one.

 If that does not help, start with the working example and add modify a 
 little bit at a time...

 ryan




Avoiding caching of special filter queries

2007-04-20 Thread Burkamp, Christian
Hi,

I'm using filter queries to implement document level security with solr.
The caching mechanism for filters separate from queries comes in handy
and the system performs well once all the filters for the users of the
system are stored in the cache.
However, I'm storing full document content in the index for the purpose
of highlighting. In addition to the standard snippet highlighting I
would like to offer a feature that displays the highlighted full
document content. I can add a filter query to select just the needed
Document by ID but this filter would go into the filter cache as well,
possibly throwing out some of the other usefull filters.
Is there a way to get the single document with highlighting info but
without polluting the filter cache?

-- Christian



AW: Avoiding caching of special filter queries

2007-04-20 Thread Burkamp, Christian
Hi Erik,

No, what I need to do is 

q=my funny queryfq=user:erikfq=id:doc Idhl=on ...

This is because the StandardRequestHandler needs the original query to do 
proper highlighting.
The user gets his paginated result page with his next 10 hits. He can then 
select one document for highlighting. Then I just repeat the last request with 
an additional filter query to select this one document and add the highlighting 
parameters.

-- Christian

-Ursprüngliche Nachricht-
Von: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 20. April 2007 15:43
An: solr-user@lucene.apache.org
Betreff: Re: Avoiding caching of special filter queries



On Apr 20, 2007, at 7:11 AM, Burkamp, Christian wrote:
 I'm using filter queries to implement document level security with
 solr.
 The caching mechanism for filters separate from queries comes in handy
 and the system performs well once all the filters for the users of the
 system are stored in the cache.
 However, I'm storing full document content in the index for the  
 purpose
 of highlighting. In addition to the standard snippet highlighting I
 would like to offer a feature that displays the highlighted full
 document content. I can add a filter query to select just the needed
 Document by ID but this filter would go into the filter cache as well,
 possibly throwing out some of the other usefull filters.
 Is there a way to get the single document with highlighting info but
 without polluting the filter cache?

Correct me if I'm wrong, but here's my understanding...

q=id:doc idfq=user:erik

is what you'd want to do.  q=id:doc won't go into the filter cache,  
but rather the query cache and the document itself into the document  
cache.  So you won't risk bumping things out of the filter cache by  
using queries.

Erik



AW: solr performance

2007-02-20 Thread Burkamp, Christian
I do agree. There's probably no need to go to the index directly.
My current solr test server has more than 5M documents and a size of about 60GB.
I still index at 13 docs per second and this still includes filtering of the 
documents.
(If you have your content ready in XML format performance will be even better).
It seems to me that indexing performance does not drop as the index increases.
Optimizing the index although does take huge amounts of time for large indexes.

--Christian

-Ursprüngliche Nachricht-
Von: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 20. Februar 2007 11:43
An: solr-user@lucene.apache.org
Betreff: Re: solr performance


You could build your index using Lucene directly and then point a  
Solr instance at it once its built.  My suspicion is that the  
overhead of forming a document as an XML string and posting to Solr  
via HTTP won't be that much different than indexing with Lucene  
directly.

My largest Solr index is currently at 1.4M and it takes a max of 3ms  
to add a document (according to Solr's console), most of them 1ms.   
My single threaded indexer is indexing around 1000 documents per  
minute, but I think I can get this number even faster by  
parallelizing the indexer.

I'm curious what rates others are indexing at ???

Erik



On Feb 20, 2007, at 2:21 AM, Jack L wrote:

 Hello,

 I have a question about solr's performance of accepting inserts and 
 indexing. If I have 10 million documents that I'd like to index, I 
 suppose it will take some time to submit them to solr. Is there any 
 faster way to do this than through the web interface?

 --
 Best regards,
 Jack

 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com



AW: highlight search keywords on html page

2007-02-19 Thread Burkamp, Christian
I was thinking about the same thing. It shouldn't be too difficult to subclass 
SolrRequestHandler and build a special HighlightingRequestHandler that uses the 
builtin highlighting utils to do the job. I wonder if it's possible to get 
access to the http request body inside a SolrRequestHandler subclass. (The raw 
text to be highlighted would have to be passed to solr as body in an http 
request).
Storing the raw text in the solr index is a reasonable solution for small 
indexes only.

--Christian


-Ursprüngliche Nachricht-
Von: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 19. Februar 2007 03:00
An: solr-user@lucene.apache.org
Betreff: Re: highlight search keywords on html page



: When a user performs a search, I will return a list of links containing
: highlighted fragments
: from pageContent. If a link is clicked, I want to return the associated
: raw html back
: to user AND have search keywords in it to be highlighted, just like google
: cached page.

i'm not really sure that Solr can help you in this case ... it only know about 
the data you give it -- if you want it to highlight the raw html of hte entire 
page, then you're going to need to store the raw html of hte entire page in the 
index.

you can still highlight pageContent with heavy fragmentation on your main 
search page where you list multiple results, and then when a user picks one 
redo the search with an fq restricting to the doc they picked and hl.fl=rawHtml 
and hl.fragsize=0 so you'll get the whole highlighted without fragmentation.

-Hoss



Re: performance testing practices

2007-02-05 Thread Burkamp, Christian
Hi there,

I am working on some performance numbers too. This is part of my evaluation of 
solr. I'm planning to replace a legacy search engine and have to find out if 
this is possible with solr.
I have loaded 1,1 million documents into solr by now. Indexing speed is not a 
big concern for me. I had about 17 documents per second while my indexing 
client is still only a python prototype with a very slow filtering engine based 
on windows Ifilter.

I'm measuring the search performance by using a python client that is 
continually querying solr. It grabs a random word from the results and uses it 
for the next search. For every search request the time from sending the request 
till receiving the response is taken. Every query uses one word as search text 
and one word as filter query text. Highlighting is on.

Some first results:

Solr loaded with 112 Documents:
Max queries per second: 14,5
Average request duration with only 1 client: 0,08 s
My criteria of 90% requests completing in less than 1 second is met with a 
maximum of 10 parallel clients.
I suspect to serve at least 300 users with one system like this.
(Measured on a single CPU Pentium4 3GHz, 2GB RAM, internal standard ATA Drive)

Next step will be to increase the number of documents till I reach the point 
where no request is completed in less than 1 second. (From this point on no 
amount of replication can bring me back to production performance).

I have a few questions, too.
- What size is the largest known solr server
- What number of documents do you think can be handled by solr
- Solr is using only one lucene index. There has been a thread about this 
before but it was more related to bringing together different lucene indexes 
under one solr server. I potentially need a solution for up to 500 millions of 
documents. I believe this will not work without splitting the index. What do 
you think?
- Does anybody have own performance numbers they would share?
- solr was running under jetty for my performance tests. What container is best 
suited for high performance?


Thanks a lot for the inspiring talk going on on this mailing list.

Christian


-Ursprüngliche Nachricht-
Von: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Gesendet: Montag, 5. Februar 2007 11:23
An: solr-user@lucene.apache.org
Betreff: performance testing practices


This week I'm going to be incrementally loading up to 3.7M records  
into Solr, in 50k chunks.

I'd like to capture some performance numbers after each chunk to see  
how she holds up.

What numbers are folks capturing?  What techniques are you using to  
capture numbers?  I'm not looking for anything elaborate, as the goal  
is really to see how faceting fares as more data is loaded.  We've  
got some ugly data in our initial experiment, so the faceting  
concerns me.

Thanks,
Erik