RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Renaud Waldura
There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.

Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology

It lists a number of other papers that are easy to find online. Let me know
what you find, I'm interested in this too.

--Renaud

 

-Original Message-
From: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov] 
Sent: Thursday, February 26, 2009 8:29 AM
To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
Subject: Use of scanned documents for text extraction and indexing


Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for extracting
text, and the resultant index quality?


Thanks in advance,
Sithu D Sudarsan

sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu






RE: Performance "dead-zone" due to garbage collection

2009-01-28 Thread Renaud Waldura
I'm coming in late on this thread, but I want to recommend the YourKit
Profiler product. It helped me track a performance problem similar to what
you describe. I had been futzing with GC logging etc. for days before
YourKit pinpointed the issue within minutes.

http://www.yourkit.com/

(My problem turned out to be silly. Straight Lucene, not Solr; the index was
opened and closed on every request. It worked OK for a few hours, then a
giant full GC kicked in, which froze the VM for minutes. Doh!)

Anyway, it may help you identify how much memory is used per request, etc.
and tune GC accordingly. Good luck!

--Renaud


-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com] 
Sent: Friday, January 23, 2009 8:13 AM
To: solr-user@lucene.apache.org
Subject: RE: Performance "dead-zone" due to garbage collection

Can you share your experience with the IBM JDK once you've evaluated it?
You are working with a heavy load, I think many would benefit from the
feedback.

-Todd Feak

-Original Message-
From: wojtekpia [mailto:wojte...@hotmail.com]
Sent: Thursday, January 22, 2009 3:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Performance "dead-zone" due to garbage collection


I'm not sure if you suggested it, but I'd like to try the IBM JVM. Aside
from
setting my JRE paths, is there anything else I need to do run inside the
IBM
JVM? (e.g. re-compiling?)


Walter Underwood wrote:
> 
> What JVM and garbage collector setting? We are using the IBM JVM with
> their concurrent generational collector. I would strongly recommend
> trying a similar collector on your JVM. Hint: how much memory is in
> use after a full GC? That is a good approximation to the working set.
> 
> 

-- 
View this message in context:
http://www.nabble.com/Performance-%22dead-zone%22-due-to-garbage-collect
ion-tp21588427p21616078.html
Sent from the Solr - User mailing list archive at Nabble.com.






RE: Accented search

2008-03-11 Thread Renaud Waldura
Peter:

Very interesting. To take care of the issue you mention, could you add
multiple "synonyms" with progressively less accents? 

E.g. you'd index "préférence" as 4 tokens:
 préférence (unchanged)
 preférence (stripped one accent)
 préference (stripped the other accent)
 preference (stripped both accents)

Or does it yield too many tokens to be useful?

And how does this take care of scoring? Do you get a higher score with a
closer match?


 

-Original Message-
From: Binkley, Peter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 11, 2008 8:37 AM
To: solr-user@lucene.apache.org
Subject: RE: Accented search

We've done this in a pre-Solr Lucene context by using the position
increment: when a token contains accented characters, you add a stripped
version of that token with a zero increment, so that for matching purposes
the original and the stripped version are at the same position. Accents are
not stripped from queries. The effect is that an accented search matches
your Doc A, and an unaccented search matches Docs A and B. We do that after
lower-casing the token.

There are some limitations: users might start to expect that they can freely
add accents to restrict their search to accented hits, but if they don't
match the accents exactly they won't get any hits: e.g. if a word contains
two accented characters and the user only accents one of them in their
query, they won't match the accented or the unaccented version. 

Peter

Peter Binkley
Digital Initiatives Technology Librarian Information Technology Services
4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]

~ The code is willing, but the data is weak. ~


-Original Message-
From: climbingrose [mailto:[EMAIL PROTECTED]
Sent: Monday, March 10, 2008 10:01 PM
To: solr-user@lucene.apache.org
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:L?p Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "L?p Trình Viên", then Doc B is also matched and "L?p
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "L?p Trình Viên", then Doc A should be given higher
score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang




RE: Color search

2007-09-28 Thread Renaud Waldura
Here's another idea: encode color mixes as one RGB value (32 bits) and sort
according to those values. To find the closest color is like finding the
closest points in the color space. It would be like a distance search.

70% black #00 = 0
20% gray #f0f0f0 = #303030
10% brown #8b4513 = #0e0702
= #3e3732

The distance would be:
sqrt( (r1 - r0)^2 + (g1 - g0)^2 + (b1 - b0)^2 )

Where r0g0b0 is the color the user asked for, and r1g1b1 is the composite
color of the item, calculated above.

--Renaud


-Original Message-
From: Steven Rowe [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 28, 2007 7:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Color search

Hi Guangwei,

When you index your products, you could have a single color field, and
include duplicates of each color component proportional to its weight.

For example, if you decide to use 10% increments, for your black dress with
70% of black, 20% of gray, 10% of brown, you would index the following terms
for the color field:

  black black black black black black black
  gray gray
  brown

This works because Lucene natively interprets document term frequencies as
weights.

Steve

Guangwei Yuan wrote:
> Hi,
> 
> We're running an e-commerce site that provides product search. We've 
> been able to extract colors from product images, and we think it'd be 
> cool and useful to search products by color. A product image can have 
> up to 5 colors (from a color space of about 100 colors), so we can 
> implement it easily with Solr's facet search (thanks all who've developed
Solr).
> 
> The problem arises when we try to sort the results by the color relevancy.
> What's different from a normal facet search is that colors are 
> weighted. For example, a black dress can have 70% of black, 20% of 
> gray, 10% of brown. A search query "color:black" should return results 
> in which the black dress ranks higher than other products with less
percentage of black.
> 
> My question is: how to configure and index the color field so that 
> products with higher percentage of color X ranks higher for query
"color:X"?
> 
> Thanks for your help!
> 
> - Guangwei




Non-HTTP Indexing

2007-09-06 Thread Renaud Waldura
Dear Solr Users:
 
Is it possible to index documents directly without going through any
XML/HTTP bridge?
I have a large collection (10^7 documents, some very large) and indexing
speed is a concern.
Thanks!
 
--Renaud