Re: How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-12 Thread Thierry Collogne

Ok. Thanks for the clarification. We will do the stripping before the
indexing.

On 11/06/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: Ok. Is it possible to get back the content without the html tags?

Solr never does anything to modify the stored value of a field, so you'd
really need to send Solr the value after strpping the HTML to get this to
work.

Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML
stripping as part of the tokenization process, so there is never a
single markup free value for the field in Solr.





-Hoss




Re: LIUS/Fulltext indexing

2007-06-12 Thread Bertrand Delacretaz

On 6/12/07, Yonik Seeley [EMAIL PROTECTED] wrote:


... I think Tika will be the way forward (some of the code for Tika is
coming from LIUS)...


Work has indeed started to incoroporate the Lius code into Tika, see
https://issues.apache.org/jira/browse/TIKA-7 and
http://incubator.apache.org/projects/tika.html

-Bertrand


storing the document URI in the index

2007-06-12 Thread Ard Schrijvers
Hello,

is it possible to configure solr to store the document URI in the lucene index 
(the URI is not an xml field, but just the document's location)? Or is 
everybody used to storing the contents of a document in the lucene index 
(doesn't this imply a much larger index though?), so instead of retrieving the 
document's content through a seperate fetch over http/filesystem just show the 
result from the stored content field?

Thx in advance for any help,

Regards Ard






Re: storing the document URI in the index

2007-06-12 Thread Erik Hatcher


On Jun 12, 2007, at 8:51 AM, Ard Schrijvers wrote:
is it possible to configure solr to store the document URI in the  
lucene index (the URI is not an xml field, but just the document's  
location)?


Yes.  Set the field to be store and non-indexed, field type string  
is what I use.


Or is everybody used to storing the contents of a document in the  
lucene index (doesn't this imply a much larger index though?), so  
instead of retrieving the document's content through a seperate  
fetch over http/filesystem just show the result from the stored  
content field?


This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are stored.


If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]


Erik



RE: storing the document URI in the index

2007-06-12 Thread Ard Schrijvers
Hello Erik, 

thanks for the fast answer (sry for my mail not indenting but must use webmail 
:-( ), but the problem I am facing is that I do not see solr storing the 
location of the documents it indexed. So, I need to store the location of a 
document in a field, but I do not see where solr would do this. Fetching the 
document will be done with the simple cocoon generator, so that is no problem, 
but of course, I need the url/uri to be in the index. I know I need it as a 
UN_TOKENIZED STORED field, but just see with LUKE that the location is not 
present in lucene index when solr crawls some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type string  
is what I use.

 Or is everybody used to storing the contents of a document in the  
 lucene index (doesn't this imply a much larger index though?), so  
 instead of retrieving the document's content through a seperate  
 fetch over http/filesystem just show the result from the stored  
 content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are stored.

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

Erik






Re: storing the document URI in the index

2007-06-12 Thread Otis Gospodnetic
Ard,

You have to store the URI in a Field yourself.  That means you need to define 
that field in the schema and you have to set its value when adding documents.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Ard Schrijvers [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, June 12, 2007 9:02:25 AM
Subject: RE: storing the document URI in the index

Hello Erik, 

thanks for the fast answer (sry for my mail not indenting but must use webmail 
:-( ), but the problem I am facing is that I do not see solr storing the 
location of the documents it indexed. So, I need to store the location of a 
document in a field, but I do not see where solr would do this. Fetching the 
document will be done with the simple cocoon generator, so that is no problem, 
but of course, I need the url/uri to be in the index. I know I need it as a 
UN_TOKENIZED STORED field, but just see with LUKE that the location is not 
present in lucene index when solr crawls some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type string  
is what I use.

 Or is everybody used to storing the contents of a document in the  
 lucene index (doesn't this imply a much larger index though?), so  
 instead of retrieving the document's content through a seperate  
 fetch over http/filesystem just show the result from the stored  
 content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are stored.

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

Erik









RE: storing the document URI in the index

2007-06-12 Thread Ard Schrijvers
Hello Otis, 

thanks for the info. Would it a be an improvement to be able to specify in the 
schema.xml wether or not the URI should be stored or not in a field which name 
you can also specify in the schema? It might be very well possible that you do 
not own the xml documents you index over http, and at the same time, you do 
not want to store its contents in the index. Since at indexing time the uri is 
known, adding it to the index is trivial. 

Regards Ard




You have to store the URI in a Field yourself.  That means you need to define 
that field in the schema and you have to set its value when adding documents.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Ard Schrijvers [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, June 12, 2007 9:02:25 AM
Subject: RE: storing the document URI in the index

Hello Erik, 

thanks for the fast answer (sry for my mail not indenting but must use webmail 
:-( ), but the problem I am facing is that I do not see solr storing the 
location of the documents it indexed. So, I need to store the location of a 
document in a field, but I do not see where solr would do this. Fetching the 
document will be done with the simple cocoon generator, so that is no problem, 
but of course, I need the url/uri to be in the index. I know I need it as a 
UN_TOKENIZED STORED field, but just see with LUKE that the location is not 
present in lucene index when solr crawls some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type string  
is what I use.

 Or is everybody used to storing the contents of a document in the  
 lucene index (doesn't this imply a much larger index though?), so  
 instead of retrieving the document's content through a seperate  
 fetch over http/filesystem just show the result from the stored  
 content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are stored.

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

Erik












indexing documents (or pieces of a document) by access controls

2007-06-12 Thread Nathaniel A. Johnson

Hi all,

Can anyone give me some advice on breaking a document up and indexing it
by access control lists.  What we have are xml documents that are
transformed based on the user viewing it.  Some users might see all of
the document, while other may see a few fields, and yet others see
nothing at all.  The access control lists may be a role the user belongs
to, it may be a list of groups, or even a combination of the two.

I can transform the xml to the plain text that I want to index, and key
it off of the acls and then pass along a list of acls that the user
issuing a query belongs to when searching.  But I guess I'm not really
sure how to do this the best way.

Anyone have any thoughts?

Thanks!
Nate



Re: storing the document URI in the index

2007-06-12 Thread Otis Gospodnetic
I'm afraid I don't understand your question.  Perhaps somebody else does.

Otis

- Original Message 
From: Ard Schrijvers [EMAIL PROTECTED]
To: solr-user@lucene.apache.org; solr-user@lucene.apache.org
Sent: Tuesday, June 12, 2007 9:23:16 AM
Subject: RE: storing the document URI in the index

Hello Otis, 

thanks for the info. Would it a be an improvement to be able to specify in the 
schema.xml wether or not the URI should be stored or not in a field which name 
you can also specify in the schema? It might be very well possible that you do 
not own the xml documents you index over http, and at the same time, you do 
not want to store its contents in the index. Since at indexing time the uri is 
known, adding it to the index is trivial. 

Regards Ard




You have to store the URI in a Field yourself.  That means you need to define 
that field in the schema and you have to set its value when adding documents.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Ard Schrijvers [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, June 12, 2007 9:02:25 AM
Subject: RE: storing the document URI in the index

Hello Erik, 

thanks for the fast answer (sry for my mail not indenting but must use webmail 
:-( ), but the problem I am facing is that I do not see solr storing the 
location of the documents it indexed. So, I need to store the location of a 
document in a field, but I do not see where solr would do this. Fetching the 
document will be done with the simple cocoon generator, so that is no problem, 
but of course, I need the url/uri to be in the index. I know I need it as a 
UN_TOKENIZED STORED field, but just see with LUKE that the location is not 
present in lucene index when solr crawls some directory with xml files,

Regards Ard Schrijvers


Yes.  Set the field to be store and non-indexed, field type string  
is what I use.

 Or is everybody used to storing the contents of a document in the  
 lucene index (doesn't this imply a much larger index though?), so  
 instead of retrieving the document's content through a seperate  
 fetch over http/filesystem just show the result from the stored  
 content field?

This all depends on the needs of your project.  Its perfectly fine to  
store the text outside of the index, and that is the way it really  
has to be done for very large indexes where as few fields as possible  
are stored.

If you're also asking about Solr fetching the remote resource, that  
is a different story altogether, and no it does not do that.  [though  
with the streaming capability you can feed in a document entirely  
from a URL, but I haven't experimented with that feature yet myself]

Erik















Re: storing the document URI in the index

2007-06-12 Thread Yonik Seeley

On 6/12/07, Ard Schrijvers [EMAIL PROTECTED] wrote:

thanks for the info. Would it a be an improvement to be able to specify in the schema.xml 
wether or not the URI should be stored or not in a field which name you can also specify 
in the schema? It might be very well possible that you do not own the xml 
documents you index over http, and at the same time, you do not want to store its 
contents in the index. Since at indexing time the uri is known, adding it to the index is 
trivial.



Think of it a different way... Solr isn't indexing XML documents, it's
simply using XML as a serialization format to pass the data to
serialize.  Often, a program is written to read some other data source
(like a database), and send an XML message to Solr to index it (and
hence the XML document only exists for a very brief time).

-Yonik


Re: storing the document URI in the index

2007-06-12 Thread Walter Underwood
Solr doesn't have the URL of the document. The document is given
to Solr in an HTTP POST.

Solr is not a web spider, it is a search web service.

wunder


On 6/12/07 6:23 AM, Ard Schrijvers [EMAIL PROTECTED] wrote:

 Hello Otis, 
 
 thanks for the info. Would it a be an improvement to be able to specify in the
 schema.xml wether or not the URI should be stored or not in a field which name
 you can also specify in the schema? It might be very well possible that you do
 not own the xml documents you index over http, and at the same time, you do
 not want to store its contents in the index. Since at indexing time the uri is
 known, adding it to the index is trivial.
 
 Regards Ard
 
 
 
 
 You have to store the URI in a Field yourself.  That means you need to define
 that field in the schema and you have to set its value when adding documents.
 
 Otis
  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
 
 - Original Message 
 From: Ard Schrijvers [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 12, 2007 9:02:25 AM
 Subject: RE: storing the document URI in the index
 
 Hello Erik, 
 
 thanks for the fast answer (sry for my mail not indenting but must use webmail
 :-( ), but the problem I am facing is that I do not see solr storing the
 location of the documents it indexed. So, I need to store the location of a
 document in a field, but I do not see where solr would do this. Fetching the
 document will be done with the simple cocoon generator, so that is no problem,
 but of course, I need the url/uri to be in the index. I know I need it as a
 UN_TOKENIZED STORED field, but just see with LUKE that the location is not
 present in lucene index when solr crawls some directory with xml files,
 
 Regards Ard Schrijvers
 
 
 Yes.  Set the field to be store and non-indexed, field type string
 is what I use.
 
 Or is everybody used to storing the contents of a document in the
 lucene index (doesn't this imply a much larger index though?), so
 instead of retrieving the document's content through a seperate
 fetch over http/filesystem just show the result from the stored
 content field?
 
 This all depends on the needs of your project.  Its perfectly fine to
 store the text outside of the index, and that is the way it really
 has to be done for very large indexes where as few fields as possible
 are stored.
 
 If you're also asking about Solr fetching the remote resource, that
 is a different story altogether, and no it does not do that.  [though
 with the streaming capability you can feed in a document entirely
 from a URL, but I haven't experimented with that feature yet myself]
 
 Erik
 
 
 
 
 
 
 
 
 
 



RE: indexing documents (or pieces of a document) by access controls

2007-06-12 Thread Ard Schrijvers
Hello Nate,

IMHO, you will not be able to do this in solr unless you accept pretty hard 
constraints on your ACLs (I will get back to this in a moment). IMO, it is not 
possible to index documents along with ACLs. ACLs can be very fine grained, and 
the thing you describe, ACL specific parts of a documentwell, I wouldn't 
know how you would index this. (imagine you change the ACL for a specific user. 
How do you know what to re-index and what not. Suppose you add a user? I really 
do not think it is possible based on fine grained ACLs). 

You also should realize you are trying to find an answer to an extremely 
complex problem: authorisation in an index (I am trying to develop facetted 
navigation in combination with authorisation in a lucene index in jackrabbit, 
but I think this is not the place to discuss it)

So, in your case, if you want to use solr and some way of ACLs, I think 
basically you can only manage this if:

1) you ACLs are some sort of paths in a hiearchical based structure, where you 
index the hierarchical structure along with the content. Then when quering you 
have to include the folders that user is allowed to see

2) you need to keep bitset for each user which documents are allowed (but, you 
have even ACLs inside documents). Also, keeping bitsets up2date for many users 
is almost impossible, because 
- lucene document ids possible change after merging segments
- updating documents might mean updating many many bitsets if you have many 
users

For these reasons, I do not think you can achieve with solar what you want, 
unless you are going to work with something like: updating the index and ACL 
bitsets once a day.

Regards Ard


Can anyone give me some advice on breaking a document up and indexing it
by access control lists.  What we have are xml documents that are
transformed based on the user viewing it.  Some users might see all of
the document, while other may see a few fields, and yet others see
nothing at all.  The access control lists may be a role the user belongs
to, it may be a list of groups, or even a combination of the two.

I can transform the xml to the plain text that I want to index, and key
it off of the acls and then pass along a list of acls that the user
issuing a query belongs to when searching.  But I guess I'm not really
sure how to do this the best way.

Anyone have any thoughts?

Thanks!
Nate






RE: storing the document URI in the index

2007-06-12 Thread Ard Schrijvers
Thanks Yonik and Walter,

putting it that way, it does make good sense to not store the transient xml 
file which it is most of the usecases (I was thinking differently because I do 
have xml files on file system or over http, like from a webdav call)

Anyway, thx for all answers, and again, sry for mails not indenting properly at 
the moment, it irritates me as well :-)

Regards Ard


 thanks for the info. Would it a be an improvement to be able to specify in 
 the schema.xml wether or not the URI should be stored or not in a field which 
 name you can also specify in the schema? It might be very well possible that 
 you do not own the xml documents you index over http, and at the same time, 
 you do not want to store its contents in the index. Since at indexing time 
 the uri is known, adding it to the index is trivial.


Think of it a different way... Solr isn't indexing XML documents, it's
simply using XML as a serialization format to pass the data to
serialize.  Often, a program is written to read some other data source
(like a database), and send an XML message to Solr to index it (and
hence the XML document only exists for a very brief time).

-Yonik





RE: indexing documents (or pieces of a document) by access controls

2007-06-12 Thread Ard Schrijvers
Excuse me, I meant solr ofcourse :-) 

 For these reasons, I do not think you can achieve with solar 


Tomcat: The requested resource (/solr/update) is not available.

2007-06-12 Thread Matt Mitchell

Hi,

I've got an app using Cocoon and Solr, both running through Tomcat.  
The post.sh file has been modified to grab local files, send it to  
Cocoon (via http), the Solr-fied xml from Cocoon is then sent to the  
update url in Tomcat/Solr. Not sure any of that is relevant though!


I'm running the post.sh file like:

post.sh ../xml/*.xml

Which sends all of the files in xml to the post.sh script.

Most of the POSTs work fine, but every once in a while I'll get:

The requested resource (/solr/update) is not available.

So my questions is this, is there a problem with sending all of those  
post requests to solr all at once? Should I be waiting to get an ok  
response before posting the next? Or is it OK to just blast solr like  
that? I'm wondering if its a Tomcat issue?


Matt


RE: Multi-language indexing and searching

2007-06-12 Thread Teruhiko Kurosaka
Daniel,
I was reading your email and responses to it with great 
interest.

I was aware that Solr has an implicit assumption that 
a field is mono-lingual per system. But your mail and
its correspondence made me wonder if this limitation 
is practical for multi-lingual search applications.  For bi-lingual 
or tri-lingual search, we can have parallel fields (title_en, 
title_fr, title_de, for example) but this wouldn't scale well.  

Assume we are making a search application for multi-lingual 
library in a university in Japan, for example, 
the application would have a book title field in Japanese, 
perhaps another title field in English for visiting 
scholars, and a title field in the original language.  
The last field's field would vary among more than 50 modern 
languages (and not so modern languages like Latin).  Solr 
may need some rearchitecutring in this area.

I work for a company called Basis Technology,
(www.basistech.com) which develops a suite of language 
processing software and I've written a module to integrate 
this with Solr (and Lucene in general).  The module is 
made of a universal Tokenizer and Analyzers for English and 
Japanese, but they can be modified easily to handle any of
the 16 languages we can handle. (Source code is provided.)

When I was developing this module, I thought of writing 
a super Analyzer that automatically detects the language 
and do the right thing.  But I've found this won't fit 
well with the design of Lucene and Solr.  For one thing, 
there is no way to save the detected language in the field, 
if the language is detected within the Analyzer.  Lucene and Solr 
requires that the language be known before an Analyzer can be 
instantiated,and it's the Analyzer that detects the language in my
design  A second obstacle is that the kinds of Filters
the Analyzer use depends on the language, so it must be
dynamically changed. This could be done programatically but
it's not easy.  My big hope is that we can work together to 
come up with some way so that the detected language within 
the Analayzer can somehow be retrieved and made it into the field.

Anyway, if you are interested in trying my multi-lingual 
Analyzers, please contact me in private email.

Regards,
-kuro


Re: Multi-language indexing and searching

2007-06-12 Thread Yonik Seeley

On 6/12/07, Teruhiko Kurosaka [EMAIL PROTECTED] wrote:

For bi-lingual
or tri-lingual search, we can have parallel fields (title_en,
title_fr, title_de, for example) but this wouldn't scale well.


Due to search across multiple fields, or due to increased index size?


Lucene and Solr
requires that the language be known before an Analyzer can be
instantiated,and it's the Analyzer that detects the language in my
design  A second obstacle is that the kinds of Filters
the Analyzer use depends on the language, so it must be
dynamically changed. This could be done programatically but
it's not easy.  My big hope is that we can work together to
come up with some way so that the detected language within
the Analayzer can somehow be retrieved and made it into the field.


Something could be done for the indexing side of things, but then how
do you query?
Would you be able to do language detection on single word queries, or
do you apply multiple analyzers and query the same field multiple ways
(which seems very close to the multiple field approach)?

Also, would multiple languages in a single field perhaps cause idf skew?

50 languages is a lot... perhaps a simple analyzer that could just try
to break into words and lowercase?


-Yonik


RE: storing the document URI in the index

2007-06-12 Thread Thorsten Scherler
On Tue, 2007-06-12 at 16:33 +0200, Ard Schrijvers wrote:
 Thanks Yonik and Walter,
 
 putting it that way, it does make good sense to not store the transient xml 
 file which it is most of the usecases (I was thinking differently because I 
 do have xml files on file system or over http, like from a webdav call)
 
 Anyway, thx for all answers, and again, sry for mails not indenting properly 
 at the moment, it irritates me as well :-)
 
 Regards Ard

Hi Ard,

you may want to have a look at 
http://wiki.apache.org/solr/SolrForrest

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: indexing documents (or pieces of a document) by access controls

2007-06-12 Thread Ken Krugler

Hi all,

Can anyone give me some advice on breaking a document up and indexing it
by access control lists.  What we have are xml documents that are
transformed based on the user viewing it.  Some users might see all of
the document, while other may see a few fields, and yet others see
nothing at all.  The access control lists may be a role the user belongs
to, it may be a list of groups, or even a combination of the two.

I can transform the xml to the plain text that I want to index, and key
it off of the acls and then pass along a list of acls that the user
issuing a query belongs to when searching.  But I guess I'm not really
sure how to do this the best way.

Anyone have any thoughts?


Given the requirement to break down a document into separately 
controlled pieces, I'd create a servlet that fronts the Solr 
servlet and handles this conversion. I could think of ways to do it 
using Solr, but they feel like unnatural acts.


As a general comment on ACLs, one relatively easy way to handle this 
is via group ids that you use to restrict the query. Each document 
has a groupid with a list of group ids that are authorized to access 
it. Each user query is converted into a (query) AND (groupid:xx OR 
groupid:yy), where xx/yy (and so on) are the groups that the user 
belongs to.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
If you can't find it, you can't fix it


Re: indexing documents (or pieces of a document) by access controls

2007-06-12 Thread Daniel Alheiros
Hi

And about the fields, if they are/aren't going to be present on the
responses based on the user group, you can do it in many different ways
(using XML transformation to remove the undesirable fields, implementing
your own RequestHandler able to process your group information, filtering
the data and showing only what should be shown to the user, ...)

Regards,
Daniel


On 12/6/07 16:14, Ken Krugler [EMAIL PROTECTED] wrote:

 Hi all,
 
 Can anyone give me some advice on breaking a document up and indexing it
 by access control lists.  What we have are xml documents that are
 transformed based on the user viewing it.  Some users might see all of
 the document, while other may see a few fields, and yet others see
 nothing at all.  The access control lists may be a role the user belongs
 to, it may be a list of groups, or even a combination of the two.
 
 I can transform the xml to the plain text that I want to index, and key
 it off of the acls and then pass along a list of acls that the user
 issuing a query belongs to when searching.  But I guess I'm not really
 sure how to do this the best way.
 
 Anyone have any thoughts?
 
 Given the requirement to break down a document into separately
 controlled pieces, I'd create a servlet that fronts the Solr
 servlet and handles this conversion. I could think of ways to do it
 using Solr, but they feel like unnatural acts.
 
 As a general comment on ACLs, one relatively easy way to handle this
 is via group ids that you use to restrict the query. Each document
 has a groupid with a list of group ids that are authorized to access
 it. Each user query is converted into a (query) AND (groupid:xx OR
 groupid:yy), where xx/yy (and so on) are the groups that the user
 belongs to.
 
 -- Ken


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: LIUS/Fulltext indexing

2007-06-12 Thread Vish D.

Sounds interesting. I can't seem to find any clear dates on the project
website. Do you know? ...V1 shipping date?

Thanks!
On 6/12/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote:


On 6/12/07, Yonik Seeley [EMAIL PROTECTED] wrote:

... I think Tika will be the way forward (some of the code for Tika is
 coming from LIUS)...

Work has indeed started to incoroporate the Lius code into Tika, see
https://issues.apache.org/jira/browse/TIKA-7 and
http://incubator.apache.org/projects/tika.html

-Bertrand



Re: LIUS/Fulltext indexing

2007-06-12 Thread Bertrand Delacretaz

On 6/12/07, Vish D. [EMAIL PROTECTED] wrote:

...Sounds interesting. I can't seem to find any clear dates on the project
website. Do you know? ...V1 shipping date?...


Not at the moment, Tika just entered incubation and it's impossible to
predict what will happen.

But help is welcome, of course ;-)

-Bertrand


RE: Multi-language indexing and searching

2007-06-12 Thread Teruhiko Kurosaka
Hi Yonik,
 On 6/12/07, Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
  For bi-lingual
  or tri-lingual search, we can have parallel fields (title_en, 
  title_fr, title_de, for example) but this wouldn't scale well.
 
 Due to search across multiple fields, or due to increased index size?

Due to the prolification of number of fields.  Say, we want
to have the field title to have the title of the book in
its original language.  But because Solr has this implicit
assumption of one language per field, we would have to have
the artifitial fields title_fr, title_de, title_en, title_es, 
etc. etc. for the number of supported languages, only one of
which has a ral value per document.  This sounds silly, doesn't it?



 Something could be done for the indexing side of things, but 
 then how do you query?
 Would you be able to do language detection on single word 
 queries, or do you apply multiple analyzers and query the 
 same field multiple ways (which seems very close to the 
 multiple field approach)?

You are right that the language auto-detection does not
work on query. The search user would have to specify the
language somehow.  One commercial search engine vendor
does this by prefixing a query term with $lang=en .
I would do this by drop down list.  Each user or session
would have a default language that is configurable.



 Also, would multiple languages in a single field perhaps 
 cause idf skew?

Sorry, I don't know enough about inside of the search engines
to discuss this.


 50 languages is a lot... perhaps a simple analyzer that could 
 just try to break into words and lowercase?

This won't work because:
(1) Concept of lowercase doesn't apply to all languages.
(2) Even among languages that use Latin script,
there can be different normalization rules.  For many
European languages, accent marks can be dropped (ü becomes
u), but for German, ü may better be mapped to ue
which is the alternative spelling of ü in German
writing. 
(3) Some languages such as Chinese and Japanese does not
even use space or other delimiters to indicate the word
boundary.  Language specific rules have to be applied
just to extract words from the run of text.

-kuro


RE: question about sorting

2007-06-12 Thread Xuesong Luo
Thanks, Yonik. Unfortunately we have users whose first names contain
more than one word, it seems copy field is my only option.

Thanks
Xuesong 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Tuesday, June 12, 2007 10:35 AM
To: solr-user@lucene.apache.org
Subject: Re: question about sorting

On 6/11/07, Xuesong Luo [EMAIL PROTECTED] wrote:
 For example, first name, department, job title etc.

Something like first name might be able to be a single field that is
searchable and sortable (use a keyword tokenizer followed by a
lowercase filter).  If the field contains multiple words, and you want
to both search and sort on that field, there isn't currently a better
alternative to copyField.

-Yonik




Re: question about sorting

2007-06-12 Thread Yonik Seeley

On 6/12/07, Xuesong Luo [EMAIL PROTECTED] wrote:

Thanks, Yonik. Unfortunately we have users whose first names contain
more than one word, it seems copy field is my only option.


Yes, if you need to be able to match on part of a first name, rather
than just exact first name.

-Yonik


RE: Multi-language indexing and searching

2007-06-12 Thread Chris Hostetter

: Due to the prolification of number of fields.  Say, we want
: to have the field title to have the title of the book in
: its original language.  But because Solr has this implicit
: assumption of one language per field, we would have to have
: the artifitial fields title_fr, title_de, title_en, title_es,
: etc. etc. for the number of supported languages, only one of
: which has a ral value per document.  This sounds silly, doesn't it?

not really, i have indexes with *thousands* of fields ... if you turn
field norms off it's extremely efficient, but even with norms: 50*n fields
where n is the number of real fields you have (title, author, etc..)
should work fine.

furthermore, declaration of these fields can be simple -- if you have a
language you want to treat special, then presumably you have a special
analyzer for it.  dynamicFields where the field name is the wildcard
and the language is set can be used to handle all of the different
indexed fields,

dynamicField name=*english type=english /
dynamicField name=*french type=french /
dynamicField name=*spanish type=german /
...more like the above for each lanague you wnat to support...
copyField source=*_english dest=english /
copyField source=*_french dest=french /
copyField source=*_spanish dest=spanish /
...more like the above for each lanague you wnat to support...

and now you can index documents with fields like this...

   author_english = Mr. Chris Hostetter
   author_spanish = Senor Cristobol Hostetter
   body_english = I can't Believe It's not butter
   body_spanish = No puedo creer que no es mantaquea
   title_english = One Man's Disbelief

...and you can search on english:Chris, spanish:Cristobol,
author_spanish:Cristobol, etc...

you could even add dynamicFields with the field name set and the language
wildcarded to handle any fields used solely for display with even less
declaration (one per field instead of one per langauge) ...

dynamicField name=display_title_* type=string /
...




-Hoss



Re: To make sure XML is UTF-8

2007-06-12 Thread Ajanta Phatak

Hi

Not sure if you've had a solution for your problem yet, but I had dealt 
with a similar issue that is mentioned below and hopefully it'll help 
you too. Of course, this assumes that your original data is in utf-8 format.


The default charset encoding for mysql is Latin1 and our display format 
was utf-8 and that was the problem. These are the steps I performed to 
get the search data in utf-8 format..


Changed the my.cnf as so (though we can avoid this by executing commands 
on every new connection if we don't want the whole db in utf format):


Under: [mysqld] added:
# setting default charset to utf-8
collation_server=utf8_unicode_ci
character_set_server=utf8
default-character-set=utf8

Under: [client]
default-character-set=utf8

After changing, restarted mysqld, re-created the db, re-inserted all the 
data again in the db using my data insert code (java program) and 
re-created the Solr index. The key is to change the settings for both 
the mysqld and client sections in my.cnf - the mysqld setting is to make 
sure that mysql doesn't convert it to latin1 while storing the data and 
the client setting is to ensure that the data is not converted while 
accessing - going in or coming out from the server.


Ajanta.


Tiong Jeffrey wrote:

Ya you are right! After I change it to UTF-8 the error still there... I
looked at the log, this is what it appears,

127.0.0.1 -  -  [10/06/2007:03:52:06 +] POST /solr/update 
HTTP/1.1 500

4022

I tried to search but couldn't understand what error is this, anybody has
any idea on this?

Thanks!!!

On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:


: way during indexing is - FATAL: Connection error (is Solr running at
: http://localhost/solr/update
: ?): java.io.IOException: Server returned HTTP Response code: 500 for
URL:
: http://local/solr/update;
: 4.Although the error code doesnt specify is XML utf-8 code error, 
but I

did
: a bit research, and look at the XML file that i have, it doesn't 
fulfill

the
: utf-8 encoding

I *strongly* encourage you to look at the body of the response and/or 
the
error log of your Servlet container and find out *exactly* what the 
cause

of the error is ... you could spend a lot of time working on this and
discover it's not your real problem.



-Hoss





Re: LIUS/Fulltext indexing

2007-06-12 Thread Vish D.

Wonder if TOM could be useful to integrate?
http://tom.library.upenn.edu/convert/sofar.html

On 6/12/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote:


On 6/12/07, Vish D. [EMAIL PROTECTED] wrote:
 ...Sounds interesting. I can't seem to find any clear dates on the
project
 website. Do you know? ...V1 shipping date?...

Not at the moment, Tika just entered incubation and it's impossible to
predict what will happen.

But help is welcome, of course ;-)

-Bertrand



Re: To make sure XML is UTF-8

2007-06-12 Thread Tiong Jeffrey

Hi Ajanta,

thanks! Since I used PHP, I managed to use the PHP decode function to change
it to UTF-8.

But just a question, even if we change mysql default char-set to UTF-8, and
if the input originally is in other format, the mysql engine won't help to
convert it to UTF-8 rite? I think my question is, what is the use of
defining the char-set in mysql other than for labeling purpose?

Thanks!

Jeffrey

On 6/13/07, Ajanta Phatak [EMAIL PROTECTED] wrote:


Hi

Not sure if you've had a solution for your problem yet, but I had dealt
with a similar issue that is mentioned below and hopefully it'll help
you too. Of course, this assumes that your original data is in utf-8
format.

The default charset encoding for mysql is Latin1 and our display format
was utf-8 and that was the problem. These are the steps I performed to
get the search data in utf-8 format..

Changed the my.cnf as so (though we can avoid this by executing commands
on every new connection if we don't want the whole db in utf format):

Under: [mysqld] added:
# setting default charset to utf-8
collation_server=utf8_unicode_ci
character_set_server=utf8
default-character-set=utf8

Under: [client]
default-character-set=utf8

After changing, restarted mysqld, re-created the db, re-inserted all the
data again in the db using my data insert code (java program) and
re-created the Solr index. The key is to change the settings for both
the mysqld and client sections in my.cnf - the mysqld setting is to make
sure that mysql doesn't convert it to latin1 while storing the data and
the client setting is to ensure that the data is not converted while
accessing - going in or coming out from the server.

Ajanta.


Tiong Jeffrey wrote:
 Ya you are right! After I change it to UTF-8 the error still there... I
 looked at the log, this is what it appears,

 127.0.0.1 -  -  [10/06/2007:03:52:06 +] POST /solr/update
 HTTP/1.1 500
 4022

 I tried to search but couldn't understand what error is this, anybody
has
 any idea on this?

 Thanks!!!

 On 6/10/07, Chris Hostetter [EMAIL PROTECTED] wrote:

 : way during indexing is - FATAL: Connection error (is Solr running at
 : http://localhost/solr/update
 : ?): java.io.IOException: Server returned HTTP Response code: 500 for
 URL:
 : http://local/solr/update;
 : 4.Although the error code doesnt specify is XML utf-8 code error,
 but I
 did
 : a bit research, and look at the XML file that i have, it doesn't
 fulfill
 the
 : utf-8 encoding

 I *strongly* encourage you to look at the body of the response and/or
 the
 error log of your Servlet container and find out *exactly* what the
 cause
 of the error is ... you could spend a lot of time working on this and
 discover it's not your real problem.



 -Hoss





Compass vs Solr

2007-06-12 Thread Harini Raghavan

Hi Everyone,

We have a web application with search functionality built using lucene. The
search is across different types of data, so it does not scale well from the
database. As lucene does not allow to store relational data, we decided to
try out Compass since it provides a object relation mapping to the lucene
index.

We have got good results with compass when compared to the database search.
But before we migrate the all the other search workflows to use compass, we
are trying to evaluate Solr. We will need to scale our application as our
data is increasing by the day.

Can anyone suggest which one would perform/scale well Compass or Solr?

OR

Has anyone tried to use a combination of Compass  Solr?

Any suggestion would be appreciated.
Thanks,
Harini