Re: Restrict access to localhost

2010-12-02 Thread Peter Karich
 for 1) use the tomcat configuration in conf/server.xml Connector 
address=127.0.0.1 port=8080 ...
for 2) if they have direct access to solr either insert a middleware 
layer or create a write lock ;-)



Hello all,

1)
I want to restrict access to Solr only in localhost. How to acheive that?

2)
If i want to allow the clients to search but not to delete? How to restric the 
access?

Any thoughts?

Regards
Ganesh.
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php




--
http://jetwick.com twitter search prototype



SOLR Thesaurus

2010-12-02 Thread lee carroll
Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: 
BT: xxx,,
NT: xxx,,
RT:xxx,xxx,xxx
Scope Note: xx,

Like i say bells and whistles

cheers Lee


Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Peter Sturge
The Win7 crashes aren't from disk drivers - they come from, in this
case, a Broadcom wireless adapter driver.
The corruption comes as a result of the 'hard stop' of Windows.

I would imagine this same problem could/would occur on any OS if the
plug was pulled from the machine.

Thanks,
Peter


On Thu, Dec 2, 2010 at 4:07 AM, Lance Norskog goks...@gmail.com wrote:
 Is there any way that Windows 7 and disk drivers are not honoring the
 fsync() calls? That would cause files and/or blocks to get saved out
 of order.

 On Tue, Nov 30, 2010 at 3:24 PM, Peter Sturge peter.stu...@gmail.com wrote:
 After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
 LockObtainFailedException errors: (excerpt)

   30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
   SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
 obtain timed out:
 nativefsl...@solr\.\.\data0\index\lucene-ad25f73e3c87e6f192c4421756925f47-write.lock


 When I run CheckIndex, I get: (excerpt)

  30 of 30: name=_2fi docCount=857
    compound=false
    hasProx=true
    numFiles=8
    size (MB)=0.769
    diagnostics = {os.version=6.1, os=Windows 7, lucene.version=3.1-dev 
 ${svnver
 sion} - 2010-09-11 11:09:06, source=flush, os.arch=amd64, 
 java.version=1.6.0_18,
 java.vendor=Sun Microsystems Inc.}
    no deletions
    test: open reader.FAILED
    WARNING: fixIndex() would remove reference to this segment; full 
 exception:
 org.apache.lucene.index.CorruptIndexException: did not read all bytes from 
 file
 _2fi.fnm: read 1 vs size 512
        at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:367)
        at org.apache.lucene.index.FieldInfos.init(FieldInfos.java:71)
        at 
 org.apache.lucene.index.SegmentReader$CoreReaders.init(SegmentReade
 r.java:119)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:583)
        at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:561)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:467)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:878)

 WARNING: 1 broken segments (containing 857 documents) detected


 This seems to happen every time Windows 7 crashes, and it would seem
 extraordinary bad luck for this tiny test index to be in the middle of
 a commit every time.
 (it is set to commit every 40secs, but for such a small index it only
 takes millis to complete)

 Does this seem right? I don't remember seeing so many corruptions in
 the index - maybe it is the world of Win7 dodgy drivers, but it would
 be worth investigating if there's something amiss in Solr/Lucene when
 things go down unexpectedly...

 Thanks,
 Peter


 On Tue, Nov 30, 2010 at 9:19 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The index itself isn't corrupt - just one of the segment files. This
 means you can read the index (less the offending segment(s)), but once
 this happens it's no longer possible to
 access the documents that were in that segment (they're gone forever),
 nor write/commit to the index (depending on the env/request, you get
 'Error reading from index file..' and/or WriteLockError)
 (note that for my use case, documents are dynamically created so can't
 be re-indexed).

 Restarting Solr fixes the write lock errors (an indirect environmental
 symptom of the problem), and running CheckIndex -fix is the only way
 I've found to repair the index so it can be written to (rewrites the
 corrupted segment(s)).

 I guess I was wondering if there's a mechanism that would support
 something akin to a transactional rollback for segments.

 Thanks,
 Peter



 On Mon, Nov 29, 2010 at 5:33 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com 
 wrote:
 If a Solr index is running at the time of a system halt, this can
 often corrupt a segments file, requiring the index to be -fix'ed by
 rewriting the offending file.

 Really?  That shouldn't be possible (if you mean the index is truly
 corrupt - i.e. you can't open it).

 -Yonik
 http://www.lucidimagination.com






 --
 Lance Norskog
 goks...@gmail.com



Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The Win7 crashes aren't from disk drivers - they come from, in this
 case, a Broadcom wireless adapter driver.
 The corruption comes as a result of the 'hard stop' of Windows.

 I would imagine this same problem could/would occur on any OS if the
 plug was pulled from the machine.

Actually, Lucene should be robust to this -- losing power, OS crash,
hardware failure (as long as the failure doesn't flip bits), etc.
This is because we do not delete files associated with an old commit
point until all files referenced by the new commit point are
successfully fsync'd.

However it sounds like something is wrong, at least on Windows 7.

I suspect it may be how we do the fsync -- if you look in
FSDirectory.fsync, you'll see that we take a String fileName in.  We
then open a new read/write RandomAccessFile, and call its
.getFD().sync().

I think this is potentially risky, ie, it would be better if we called
.sync() on the original file we had opened for writing and written
lots of data to, before closing it, instead of closing it, opening a
new FileDescriptor, and calling sync on it.  We could conceivably take
this approach, entirely in the Directory impl, by keeping the pool of
file handles for write open even after .close() was called.  When a
file is deleted we'd remove it from that pool, and when it's finally
sync'd we'd then sync it and remove it from the pool.

Could it be that on Windows 7 the way we fsync (opening a new
FileDescriptor long after the first one was closed) doesn't in fact
work?

Mike


Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Peter Sturge
As I'm not familiar with the syncing in Lucene, I couldn't say whether
there's a specific problem with regards Win7/2008 server etc.

Windows has long had the somewhat odd behaviour of deliberately
caching file handles after an explicit close(). This has been part of
NTFS since NT 4 days, but there may be some new behaviour introduced
in Windows 6.x (and there is a lot of new behaviour) that causes an
issue. I have also seen this problem in Windows Server 2008 (server
version of Win7 - same file system).

I'll try some further testing on previous Windows versions, but I've
not previously come across a single segment corruption on Win 2k3/XP
after hard failures. In fact, it was when I first encountered this
problem on Server 2008 that I even discovered CheckIndex existed!

I guess a good question for the community is: Has anyone else
seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
Win7)?

Mike, are there any diagnostics/config etc. that I could try to help
isolate the problem?

Many thanks,
Peter



On Thu, Dec 2, 2010 at 9:28 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, Dec 2, 2010 at 4:10 AM, Peter Sturge peter.stu...@gmail.com wrote:
 The Win7 crashes aren't from disk drivers - they come from, in this
 case, a Broadcom wireless adapter driver.
 The corruption comes as a result of the 'hard stop' of Windows.

 I would imagine this same problem could/would occur on any OS if the
 plug was pulled from the machine.

 Actually, Lucene should be robust to this -- losing power, OS crash,
 hardware failure (as long as the failure doesn't flip bits), etc.
 This is because we do not delete files associated with an old commit
 point until all files referenced by the new commit point are
 successfully fsync'd.

 However it sounds like something is wrong, at least on Windows 7.

 I suspect it may be how we do the fsync -- if you look in
 FSDirectory.fsync, you'll see that we take a String fileName in.  We
 then open a new read/write RandomAccessFile, and call its
 .getFD().sync().

 I think this is potentially risky, ie, it would be better if we called
 .sync() on the original file we had opened for writing and written
 lots of data to, before closing it, instead of closing it, opening a
 new FileDescriptor, and calling sync on it.  We could conceivably take
 this approach, entirely in the Directory impl, by keeping the pool of
 file handles for write open even after .close() was called.  When a
 file is deleted we'd remove it from that pool, and when it's finally
 sync'd we'd then sync it and remove it from the pool.

 Could it be that on Windows 7 the way we fsync (opening a new
 FileDescriptor long after the first one was closed) doesn't in fact
 work?

 Mike



Re: Best practice for Delta every 2 Minutes.

2010-12-02 Thread stockii

at the time no OOM occurs. but we are not in correct live system ... 

i thougt maybe i get this problem ... 

we are running seven cores and each want be update very fast. only one core
have a huge index with 28M docs. maybe it makes sense for the future to use
solr with replication !? or can i runs two instances, one for search and one
for updating ? or is there the danger of corrupt indizes ? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p2005108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
Hi,

we have a serious harddisk problem, and it's definitely related to a 
full-import from a relational
database into a solr index.

The first time it happened on our development server, where the raidcontroller 
crashed during a full-import
of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2 of 
the harddisks where the solr
index files are located stopped working (we needed to replace them).

After the crash of the raid controller, we decided to move the development of 
solr/index related stuff to our
local development machines. 

Yesterday i was running another full-import of ~10 Million documents on my 
local development machine, 
and during the import, a harddisk failure occurred. Since this failure, my 
harddisk activity seems to 
be around 100% all the time, even if no solr server is running at all. 

I've been googling the last 2 days to find some info about solr related 
harddisk problems, but i didn't find anything
useful.

Are there any steps we need to take care of in respect to harddisk failures 
when doing a full-import? Right now,
our steps look like this:

1. Delete the current index
2. Restart solr, to load the updated schemas
3. Start the full import

Initially, the solr index and the relational database were located on the same 
harddisk. After the crash, we moved
the index to a separate harddisk, but nevertheless this harddisk crashed too.

I'd really appreciate any hints on what we might do wrong when importing data, 
as we can't release this
on our production servers when there's the risk of harddisk failures.


thanks.


-robert







Re: Preventing index segment corruption when windows crashes

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:53 AM, Peter Sturge peter.stu...@gmail.com wrote:
 As I'm not familiar with the syncing in Lucene, I couldn't say whether
 there's a specific problem with regards Win7/2008 server etc.

 Windows has long had the somewhat odd behaviour of deliberately
 caching file handles after an explicit close(). This has been part of
 NTFS since NT 4 days, but there may be some new behaviour introduced
 in Windows 6.x (and there is a lot of new behaviour) that causes an
 issue. I have also seen this problem in Windows Server 2008 (server
 version of Win7 - same file system).

 I'll try some further testing on previous Windows versions, but I've
 not previously come across a single segment corruption on Win 2k3/XP
 after hard failures. In fact, it was when I first encountered this
 problem on Server 2008 that I even discovered CheckIndex existed!

 I guess a good question for the community is: Has anyone else
 seen/reproduced this problem on Windows 6.x (i.e. Server 2008 or
 Win7)?

 Mike, are there any diagnostics/config etc. that I could try to help
 isolate the problem?

Actually it might be easiest to make a standalone Java test, maybe
using Lucene's FSDir, that opens files in sequence (0.bin, 1.bin,
2.bin...), writes verifiable them (eg random bytes from a fixed seed)
and then closes  syncs each one.  Then, crash the box while this is
running.  Finally, run a verify step that checks that the data is
correct?  Ie that our attempt to fsync worked?

It could very well be that windows 6.x is now smarter about fsync in
that it only syncs bytes actually written with the currently open file
descriptor, and not bytes written agains the same file by past file
descriptors (ie via a global buffer cache, like Linux).

Mike


Re: Troubles with forming query for solr.

2010-12-02 Thread Savvas-Andreas Moysidis
Hello,

would something similar along those lines:
(field1:term AND field2:term AND field3:term)^2 OR (field1:term AND
field2:term)^0.8 OR (field2:term AND field3:term)^0.5

work? You'll probably need to experiment with the boost values to get the
desired result.

Another option could be investigating the Dismax handler.

On 1 December 2010 02:38, kolesman alekkolesni...@gmail.com wrote:


 Hi,

 I have some troubles with forming query for solr.

 Here is my task :
 I'm indexing objects with 3 fields, for example {field1, field2, filed3}
 In solr's response I want to get object in special order :
 1. Firstly I want to get objects where all 3 fields are matched
 2. Then I want to get objects where ONLY field1 and field2 are matched
 3. And finnally I want to get objects where ONLY field2 and field3 are
 matched.

 Could your explain me how to form query for my task?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Troubles-with-forming-query-for-solr-tp1996630p1996630.html
 Sent from the Solr - User mailing list archive at Nabble.com.



RE: Return Lucene DocId in Solr Results

2010-12-02 Thread Lohrenz, Steven
I know the doc ids from one core have nothing to do with the other. I was going 
to use the docId returned from the first core in the solr results and store it 
in the second core that way the second core knows about the doc ids from the 
first core. So when you query the second core from the Filter in the first core 
you get returned a set of data that includes the docId from the first core that 
the document relates to. 

I have backed off from this approach and have a user defined primary key in the 
firstCore, which is stored as the reference in the secondCore and when the 
filter performs the search it goes off and queries the firstCore for each 
primary key and gets the lucene docId from the returned doc. 

Thanks,
Steve

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 December 2010 02:19
To: solr-user@lucene.apache.org
Subject: Re: Return Lucene DocId in Solr Results

On the face of it, this doesn't make sense, so perhaps you can explain a
bit.The doc IDs
from one Solr instance have no relation to the doc IDs from another Solr
instance. So anything
that uses doc IDs from one Solr instance to create a filter on another
instance doesn't seem
to be something you'd want to do...

Which may just mean I don't understand what you're trying to do. Can you
back up a bit
and describe the higher-level problem? This seems like it may be an XY
problem, see:
http://people.apache.org/~hossman/#xyproblem

Best
Erick

On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 Hi,

 I was wondering how I would go about getting the lucene docid included in
 the results from a solr query?

 I've built a QueryParser to query another solr instance and and join the
 results of the two instances through the use of a Filter.  The Filter needs
 the lucene docid to work. This is the only bit I'm missing right now.

 Thanks,
 Steve




Re: Tuning Solr caches with high commit rates (NRT)

2010-12-02 Thread stockii

great thread and exactly my problems :D

i set up two solr-instances, one for update the index and another for
searching. 

When i perform an update. the search-instance dont get the new documents.
when i start a commit on searcher he found it. how can i say the searcher
that he alwas look not only the old index. automatic refresh ? XD
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tuning-Solr-caches-with-high-commit-rates-NRT-tp1461275p2005738.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tuning Solr caches with high commit rates (NRT)

2010-12-02 Thread Peter Sturge
In order for the 'read-only' instance to see any new/updated
documents, it needs to do a commit (since it's read-only, it is a
commit of 0 documents).
You can do this via a client service that issues periodic commits, or
use autorefresh from within solrconfig.xml. Be careful that you don't
do anything in the read-only instance that will change the underlying
index - like optimize.

Peter


On Thu, Dec 2, 2010 at 12:51 PM, stockii st...@shopgate.com wrote:

 great thread and exactly my problems :D

 i set up two solr-instances, one for update the index and another for
 searching.

 When i perform an update. the search-instance dont get the new documents.
 when i start a commit on searcher he found it. how can i say the searcher
 that he alwas look not only the old index. automatic refresh ? XD
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Tuning-Solr-caches-with-high-commit-rates-NRT-tp1461275p2005738.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Best practice for Delta every 2 Minutes.

2010-12-02 Thread Erick Erickson
In fact, having a master/slave where the master is the
indexing/updating machine and the slave(s) are searchers
is one of the recommended configurations. The replication
is used in many, many sites so it's pretty solid.

It's generally not recommended, though, to run separate
instances on the *same* server. No matter how many
cores/instances/etc, you're still running on the same
physical hardware so I/O contention, memory issues, etc
are still bounded by your hardware

Best
Erick

On Thu, Dec 2, 2010 at 5:12 AM, stockii st...@shopgate.com wrote:


 at the time no OOM occurs. but we are not in correct live system ...

 i thougt maybe i get this problem ...

 we are running seven cores and each want be update very fast. only one core
 have a huge index with 28M docs. maybe it makes sense for the future to use
 solr with replication !? or can i runs two instances, one for search and
 one
 for updating ? or is there the danger of corrupt indizes ?
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p2005108.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dataimport destroys our harddisks

2010-12-02 Thread Erick Erickson
The very first thing I'd ask is how much free space is on your disk
when this occurs? Is it possible that you're simply filling up your
disk?

do note that an optimize may require up to 2X the size of your index
if/when it occurs. Are you sure you aren't optimizing as you add
items to your index?

But I've never heard of Solr causing hard disk crashes, it doesn't do
anything special but read/write...

Best
Erick

2010/12/2 Robert Gründler rob...@dubture.com

 Hi,

 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.

 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).

 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.

 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.

 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.

 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:

 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import

 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.

 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.


 thanks.


 -robert








Re: Return Lucene DocId in Solr Results

2010-12-02 Thread Erick Erickson
Sounds good, especially because your old scenario was fragile. The doc IDs
in
your first core could change as a result of a single doc deletion and
optimize. So
the doc IDs stored in the second core would then be wrong...

Your user-defined unique key is definitely a better way to go. There are
some tricks
you could try if there are performance issues

Best
Erick

On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 I know the doc ids from one core have nothing to do with the other. I was
 going to use the docId returned from the first core in the solr results and
 store it in the second core that way the second core knows about the doc ids
 from the first core. So when you query the second core from the Filter in
 the first core you get returned a set of data that includes the docId from
 the first core that the document relates to.

 I have backed off from this approach and have a user defined primary key in
 the firstCore, which is stored as the reference in the secondCore and when
 the filter performs the search it goes off and queries the firstCore for
 each primary key and gets the lucene docId from the returned doc.

 Thanks,
 Steve

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 02 December 2010 02:19
 To: solr-user@lucene.apache.org
 Subject: Re: Return Lucene DocId in Solr Results

 On the face of it, this doesn't make sense, so perhaps you can explain a
 bit.The doc IDs
 from one Solr instance have no relation to the doc IDs from another Solr
 instance. So anything
 that uses doc IDs from one Solr instance to create a filter on another
 instance doesn't seem
 to be something you'd want to do...

 Which may just mean I don't understand what you're trying to do. Can you
 back up a bit
 and describe the higher-level problem? This seems like it may be an XY
 problem, see:
 http://people.apache.org/~hossman/#xyproblem

 Best
 Erick

 On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
 steven.lohr...@hmhpub.comwrote:

  Hi,
 
  I was wondering how I would go about getting the lucene docid included in
  the results from a solr query?
 
  I've built a QueryParser to query another solr instance and and join the
  results of the two instances through the use of a Filter.  The Filter
 needs
  the lucene docid to work. This is the only bit I'm missing right now.
 
  Thanks,
  Steve
 
 



RE: SOLR Thesaurus

2010-12-02 Thread Steven A Rowe
Hi Lee,

Can you describe your thesaurus format (it's not exactly self-descriptive) and 
how you would like it to be applied?

I gather you're referring to a thesaurus feature in another product (or product 
class)?  Maybe if you describe that it would help too.

Steve

 -Original Message-
 From: lee carroll [mailto:lee.a.carr...@googlemail.com]
 Sent: Thursday, December 02, 2010 3:56 AM
 To: solr-user@lucene.apache.org
 Subject: SOLR Thesaurus
 
 Hi List,
 
 Coming to and end of a proto type evaluation of SOLR (all very good etc
 etc)
 Getting to the point at looking at bells and whistles. Does SOLR have a
 thesuarus. Cant find any refrerence
 to one in the docs and on the wiki etc. (Apart from a few mail threads
 which
 describe the synonym.txt as a thesuarus)
 
 I mean something like:
 
 PT: 
 BT: xxx,,
 NT: xxx,,
 RT:xxx,xxx,xxx
 Scope Note: xx,
 
 Like i say bells and whistles
 
 cheers Lee


RE: Return Lucene DocId in Solr Results

2010-12-02 Thread Lohrenz, Steven
I would be interested in hearing about some ways to improve the algorithm. I 
have done a very straightforward Lucene query within a loop to get the docIds.

Here's what I did to get it working where favsBean are objects returned from a 
query of the second core, but there is probably a better way to do it:

private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, ListFavorites 
favsBeans) throws ParseException {
// open the core  get data directory
String indexDir = req.getCore().getIndexDir();
FSDirectory index = null;
try {
index = FSDirectory.open(new File(indexDir));
} catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:  
+ indexDir +   + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
String pkQueryString = resourceId: + favBean.getResourceId();
Query pkQuery = new QueryParser(Version.LUCENE_CURRENT, 
resourceId, new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
try {
searcher = new IndexSearcher(index, true);
collector = TopScoreDocCollector.create(1, true);
searcher.search(pkQuery, collector);
} catch (IOException e) {
throw new ParseException(IOException, cannot search the index 
at:  + indexDir +   + e.getMessage());
}

ScoreDoc[] hits = collector.topDocs().scoreDocs;
if(hits != null  hits[0] != null) {
docIds[i] = hits[0].doc;
i++;
}
}

Arrays.sort(docIds);
return docIds;
}

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 December 2010 13:46
To: solr-user@lucene.apache.org
Subject: Re: Return Lucene DocId in Solr Results

Sounds good, especially because your old scenario was fragile. The doc IDs
in
your first core could change as a result of a single doc deletion and
optimize. So
the doc IDs stored in the second core would then be wrong...

Your user-defined unique key is definitely a better way to go. There are
some tricks
you could try if there are performance issues

Best
Erick

On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 I know the doc ids from one core have nothing to do with the other. I was
 going to use the docId returned from the first core in the solr results and
 store it in the second core that way the second core knows about the doc ids
 from the first core. So when you query the second core from the Filter in
 the first core you get returned a set of data that includes the docId from
 the first core that the document relates to.

 I have backed off from this approach and have a user defined primary key in
 the firstCore, which is stored as the reference in the secondCore and when
 the filter performs the search it goes off and queries the firstCore for
 each primary key and gets the lucene docId from the returned doc.

 Thanks,
 Steve

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 02 December 2010 02:19
 To: solr-user@lucene.apache.org
 Subject: Re: Return Lucene DocId in Solr Results

 On the face of it, this doesn't make sense, so perhaps you can explain a
 bit.The doc IDs
 from one Solr instance have no relation to the doc IDs from another Solr
 instance. So anything
 that uses doc IDs from one Solr instance to create a filter on another
 instance doesn't seem
 to be something you'd want to do...

 Which may just mean I don't understand what you're trying to do. Can you
 back up a bit
 and describe the higher-level problem? This seems like it may be an XY
 problem, see:
 http://people.apache.org/~hossman/#xyproblem

 Best
 Erick

 On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
 steven.lohr...@hmhpub.comwrote:

  Hi,
 
  I was wondering how I would go about getting the lucene docid included in
  the results from a solr query?
 
  I've built a QueryParser to query another solr instance and and join the
  results of the two instances through the use of a Filter.  The Filter
 needs
  the lucene docid to work. This is the only bit I'm missing right now.
 
  Thanks,
  Steve
 
 



Re: Return Lucene DocId in Solr Results

2010-12-02 Thread Erick Erickson
Ahhh, you're already down in Lucene. That makes things easier...

See TermDocs. Particularly seek(Term). That'll directly access the indexed
unique key rather than having to form a bunch of queries.

Best
Erick


On Thu, Dec 2, 2010 at 8:59 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 I would be interested in hearing about some ways to improve the algorithm.
 I have done a very straightforward Lucene query within a loop to get the
 docIds.

 Here's what I did to get it working where favsBean are objects returned
 from a query of the second core, but there is probably a better way to do
 it:

 private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, ListFavorites
 favsBeans) throws ParseException {
// open the core  get data directory
String indexDir = req.getCore().getIndexDir();
FSDirectory index = null;
try {
index = FSDirectory.open(new File(indexDir));
} catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:
  + indexDir +   + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
String pkQueryString = resourceId: + favBean.getResourceId();
Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
 resourceId, new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
try {
searcher = new IndexSearcher(index, true);
collector = TopScoreDocCollector.create(1, true);
searcher.search(pkQuery, collector);
} catch (IOException e) {
throw new ParseException(IOException, cannot search the
 index at:  + indexDir +   + e.getMessage());
}

ScoreDoc[] hits = collector.topDocs().scoreDocs;
if(hits != null  hits[0] != null) {
docIds[i] = hits[0].doc;
i++;
}
}

Arrays.sort(docIds);
return docIds;
 }

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 02 December 2010 13:46
 To: solr-user@lucene.apache.org
 Subject: Re: Return Lucene DocId in Solr Results

 Sounds good, especially because your old scenario was fragile. The doc IDs
 in
 your first core could change as a result of a single doc deletion and
 optimize. So
 the doc IDs stored in the second core would then be wrong...

 Your user-defined unique key is definitely a better way to go. There are
 some tricks
 you could try if there are performance issues

 Best
 Erick

 On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, Steven
 steven.lohr...@hmhpub.comwrote:

  I know the doc ids from one core have nothing to do with the other. I was
  going to use the docId returned from the first core in the solr results
 and
  store it in the second core that way the second core knows about the doc
 ids
  from the first core. So when you query the second core from the Filter in
  the first core you get returned a set of data that includes the docId
 from
  the first core that the document relates to.
 
  I have backed off from this approach and have a user defined primary key
 in
  the firstCore, which is stored as the reference in the secondCore and
 when
  the filter performs the search it goes off and queries the firstCore for
  each primary key and gets the lucene docId from the returned doc.
 
  Thanks,
  Steve
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: 02 December 2010 02:19
  To: solr-user@lucene.apache.org
  Subject: Re: Return Lucene DocId in Solr Results
 
  On the face of it, this doesn't make sense, so perhaps you can explain a
  bit.The doc IDs
  from one Solr instance have no relation to the doc IDs from another Solr
  instance. So anything
  that uses doc IDs from one Solr instance to create a filter on another
  instance doesn't seem
  to be something you'd want to do...
 
  Which may just mean I don't understand what you're trying to do. Can you
  back up a bit
  and describe the higher-level problem? This seems like it may be an XY
  problem, see:
  http://people.apache.org/~hossman/#xyproblem
 
  Best
  Erick
 
  On Tue, Nov 30, 2010 at 6:57 AM, Lohrenz, Steven
  steven.lohr...@hmhpub.comwrote:
 
   Hi,
  
   I was wondering how I would go about getting the lucene docid included
 in
   the results from a solr query?
  
   I've built a QueryParser to query another solr instance and and join
 the
   results of the two instances through the use of a Filter.  The Filter
  needs
   the lucene docid to work. This is the only bit I'm missing right now.
  
   Thanks,
   Steve
  
  
 



Re: Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
 The very first thing I'd ask is how much free space is on your disk
 when this occurs? Is it possible that you're simply filling up your
 disk?

no, i've checked that already. all disks have plenty of space (they have
a capacity of 2TB, and are currently filled up to 20%.

 
 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?
 

index size is not a problem in our case. Our index currently has about 3GB.

What do you mean with optimizing as you add items to your index? 

 But I've never heard of Solr causing hard disk crashes,

neither did we, and google is the same opinion. 

One thing that i've found is the mergeFactor value:

http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor

Our sysadmin speculates that maybe the chunk size of our raid/harddisks
and the segment size of the lucene index does not play well together.

Does the lucene segment size affect how the data is written to the disk?


thanks for your help.


-robert







 
 Best
 Erick
 
 2010/12/2 Robert Gründler rob...@dubture.com
 
 Hi,
 
 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.
 
 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).
 
 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.
 
 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.
 
 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.
 
 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:
 
 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import
 
 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.
 
 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.
 
 
 thanks.
 
 
 -robert
 
 
 
 
 
 



Re: Dataimport destroys our harddisks

2010-12-02 Thread Sven Almgren
What Raid controller do you use, and what kernel version? (Assuming
Linux). We hade problems during high load with a 3Ware raid controller
and the current kernel for Ubuntu 10.04, we hade to downgrade the
kernel...

The problem was a bug in the driver that only showed up with very high
disk load (as is the case when doing imports)

/Sven

2010/12/2 Robert Gründler rob...@dubture.com:
 The very first thing I'd ask is how much free space is on your disk
 when this occurs? Is it possible that you're simply filling up your
 disk?

 no, i've checked that already. all disks have plenty of space (they have
 a capacity of 2TB, and are currently filled up to 20%.


 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?


 index size is not a problem in our case. Our index currently has about 3GB.

 What do you mean with optimizing as you add items to your index?

 But I've never heard of Solr causing hard disk crashes,

 neither did we, and google is the same opinion.

 One thing that i've found is the mergeFactor value:

 http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor

 Our sysadmin speculates that maybe the chunk size of our raid/harddisks
 and the segment size of the lucene index does not play well together.

 Does the lucene segment size affect how the data is written to the disk?


 thanks for your help.


 -robert








 Best
 Erick

 2010/12/2 Robert Gründler rob...@dubture.com

 Hi,

 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.

 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).

 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.

 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.

 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.

 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:

 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import

 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.

 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.


 thanks.


 -robert










Re: Dataimport destroys our harddisks

2010-12-02 Thread Robert Gründler
On Dec 2, 2010, at 15:43 , Sven Almgren wrote:

 What Raid controller do you use, and what kernel version? (Assuming
 Linux). We hade problems during high load with a 3Ware raid controller
 and the current kernel for Ubuntu 10.04, we hade to downgrade the
 kernel...
 
 The problem was a bug in the driver that only showed up with very high
 disk load (as is the case when doing imports)
 

We're running freebsd:

RaidController  3ware 9500S-8
Corrupt unit: Raid-10 3725.27GB 256K Stripe Size without BBU
Freebsd 7.2, UFS Filesystem.



 /Sven
 
 2010/12/2 Robert Gründler rob...@dubture.com:
 The very first thing I'd ask is how much free space is on your disk
 when this occurs? Is it possible that you're simply filling up your
 disk?
 
 no, i've checked that already. all disks have plenty of space (they have
 a capacity of 2TB, and are currently filled up to 20%.
 
 
 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?
 
 
 index size is not a problem in our case. Our index currently has about 3GB.
 
 What do you mean with optimizing as you add items to your index?
 
 But I've never heard of Solr causing hard disk crashes,
 
 neither did we, and google is the same opinion.
 
 One thing that i've found is the mergeFactor value:
 
 http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor
 
 Our sysadmin speculates that maybe the chunk size of our raid/harddisks
 and the segment size of the lucene index does not play well together.
 
 Does the lucene segment size affect how the data is written to the disk?
 
 
 thanks for your help.
 
 
 -robert
 
 
 
 
 
 
 
 
 Best
 Erick
 
 2010/12/2 Robert Gründler rob...@dubture.com
 
 Hi,
 
 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.
 
 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).
 
 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.
 
 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.
 
 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.
 
 Are there any steps we need to take care of in respect to harddisk failures
 when doing a full-import? Right now,
 our steps look like this:
 
 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import
 
 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.
 
 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.
 
 
 thanks.
 
 
 -robert
 
 
 
 
 
 
 
 



Re: Dataimport destroys our harddisks

2010-12-02 Thread Sven Almgren
That's the same series we use... we hade problems when running other
disk-heavy operations like rsync and backup on them too..

But in our case we mostly had hangs or load  180 :P... Can you
simulate very heavy random disk i/o? if so then you could check if you
still have the same problems...

That's all I can be of help with, good luck :)

/Sven

2010/12/2 Robert Gründler rob...@dubture.com:
 On Dec 2, 2010, at 15:43 , Sven Almgren wrote:

 What Raid controller do you use, and what kernel version? (Assuming
 Linux). We hade problems during high load with a 3Ware raid controller
 and the current kernel for Ubuntu 10.04, we hade to downgrade the
 kernel...

 The problem was a bug in the driver that only showed up with very high
 disk load (as is the case when doing imports)


 We're running freebsd:

 RaidController  3ware 9500S-8
 Corrupt unit: Raid-10 3725.27GB 256K Stripe Size without BBU
 Freebsd 7.2, UFS Filesystem.



 /Sven

 2010/12/2 Robert Gründler rob...@dubture.com:
 The very first thing I'd ask is how much free space is on your disk
 when this occurs? Is it possible that you're simply filling up your
 disk?

 no, i've checked that already. all disks have plenty of space (they have
 a capacity of 2TB, and are currently filled up to 20%.


 do note that an optimize may require up to 2X the size of your index
 if/when it occurs. Are you sure you aren't optimizing as you add
 items to your index?


 index size is not a problem in our case. Our index currently has about 3GB.

 What do you mean with optimizing as you add items to your index?

 But I've never heard of Solr causing hard disk crashes,

 neither did we, and google is the same opinion.

 One thing that i've found is the mergeFactor value:

 http://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor

 Our sysadmin speculates that maybe the chunk size of our raid/harddisks
 and the segment size of the lucene index does not play well together.

 Does the lucene segment size affect how the data is written to the disk?


 thanks for your help.


 -robert








 Best
 Erick

 2010/12/2 Robert Gründler rob...@dubture.com

 Hi,

 we have a serious harddisk problem, and it's definitely related to a
 full-import from a relational
 database into a solr index.

 The first time it happened on our development server, where the
 raidcontroller crashed during a full-import
 of ~ 8 Million documents. This happened 2 weeks ago, and in this period 2
 of the harddisks where the solr
 index files are located stopped working (we needed to replace them).

 After the crash of the raid controller, we decided to move the development
 of solr/index related stuff to our
 local development machines.

 Yesterday i was running another full-import of ~10 Million documents on my
 local development machine,
 and during the import, a harddisk failure occurred. Since this failure, my
 harddisk activity seems to
 be around 100% all the time, even if no solr server is running at all.

 I've been googling the last 2 days to find some info about solr related
 harddisk problems, but i didn't find anything
 useful.

 Are there any steps we need to take care of in respect to harddisk 
 failures
 when doing a full-import? Right now,
 our steps look like this:

 1. Delete the current index
 2. Restart solr, to load the updated schemas
 3. Start the full import

 Initially, the solr index and the relational database were located on the
 same harddisk. After the crash, we moved
 the index to a separate harddisk, but nevertheless this harddisk crashed
 too.

 I'd really appreciate any hints on what we might do wrong when importing
 data, as we can't release this
 on our production servers when there's the risk of harddisk failures.


 thanks.


 -robert












RE: Return Lucene DocId in Solr Results

2010-12-02 Thread Lohrenz, Steven
I must be missing something as I'm getting a NPE on the line: docIds[i] = 
termDocs.doc(); 
here's what I came up with:

private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, ListFavorites 
favsBeans) throws ParseException {
// open the core  get data directory
String indexDir = req.getCore().getIndexDir();

FSDirectory indexDirectory = null;
try {
indexDirectory = FSDirectory.open(new File(indexDir));
} catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:  
+ indexDir +   + e.getMessage());
}

//String pkQueryString = resourceId: + favBean.getResourceId();
//Query pkQuery = new QueryParser(Version.LUCENE_CURRENT, resourceId, 
new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
IndexReader indexReader = null;
TermDocs termDocs = null;

try {
searcher = new IndexSearcher(indexDirectory, true);
indexReader = new FilterIndexReader(searcher.getIndexReader());
termDocs = indexReader.termDocs();
} catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:  
+ indexDir +   + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
Term term = new Term(resourceId, favBean.getResourceId());
try {
termDocs.seek(term);
docIds[i] = termDocs.doc();
} catch (IOException e) {
throw new ParseException(IOException, cannot seek to the 
primary key  + favBean.getResourceId() +  in :  + indexDir +   + 
e.getMessage());
}
//ScoreDoc[] hits = collector.topDocs().scoreDocs;
//if(hits != null  hits[0] != null) {

i++;
//}
}

Arrays.sort(docIds);
return docIds;
}

Thanks,
Steve
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 02 December 2010 14:20
To: solr-user@lucene.apache.org
Subject: Re: Return Lucene DocId in Solr Results

Ahhh, you're already down in Lucene. That makes things easier...

See TermDocs. Particularly seek(Term). That'll directly access the indexed
unique key rather than having to form a bunch of queries.

Best
Erick


On Thu, Dec 2, 2010 at 8:59 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 I would be interested in hearing about some ways to improve the algorithm.
 I have done a very straightforward Lucene query within a loop to get the
 docIds.

 Here's what I did to get it working where favsBean are objects returned
 from a query of the second core, but there is probably a better way to do
 it:

 private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, ListFavorites
 favsBeans) throws ParseException {
// open the core  get data directory
String indexDir = req.getCore().getIndexDir();
FSDirectory index = null;
try {
index = FSDirectory.open(new File(indexDir));
} catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:
  + indexDir +   + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
String pkQueryString = resourceId: + favBean.getResourceId();
Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
 resourceId, new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
try {
searcher = new IndexSearcher(index, true);
collector = TopScoreDocCollector.create(1, true);
searcher.search(pkQuery, collector);
} catch (IOException e) {
throw new ParseException(IOException, cannot search the
 index at:  + indexDir +   + e.getMessage());
}

ScoreDoc[] hits = collector.topDocs().scoreDocs;
if(hits != null  hits[0] != null) {
docIds[i] = hits[0].doc;
i++;
}
}

Arrays.sort(docIds);
return docIds;
 }

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 02 December 2010 13:46
 To: solr-user@lucene.apache.org
 Subject: Re: Return Lucene DocId in Solr Results

 Sounds good, especially because your old scenario was fragile. The doc IDs
 in
 your first core could change as a result of a single doc deletion and
 optimize. So
 the doc IDs stored in the second core would then be wrong...

 Your user-defined unique key is definitely a better way to go. There are
 some tricks
 you could try if there are performance issues

 Best
 Erick

 On Thu, Dec 2, 2010 at 7:47 AM, Lohrenz, 

Multi-valued poly fields search

2010-12-02 Thread Vincent Cautaerts
Hi,

(should this be on solr-dev mailing list?)

I have this kind of data, about articles in newspapers:

article A-001
.  published on 2010-10-31, in newspaper N-1, edition E1
.  published on 2010-10-30, in newspaper N-2, edition E2
article A-002
.  published on 2010-10-30, in newspaper N-1, edition E1

I have to be able to search on those sub-fields, eg:

all articles published on 2010-10-30 in newspaper N-1 (all editions)

I expect to find document A-002, but not document A-001

I control the indexing, analyzers,... but I would like use standard Solr
query syntax (or an extension of it)

If I index those documents:
add
doc
field name=idA-001/field
field name=pubDate2010-10-31/field
field name=nsN-1/field
field name=edE1/field
field name=pubDate2010-10-30/field
field name=nsN-2/field
field name=edE2/field
/doc
doc
field name=idA-002/field
field name=pubDate2010-10-30/field
field name=nsN-1/field
field name=edE1/field
/doc
/add

(ie: flattening the structure, losing the link between newspapers and dates)
then a search for pubDate=2010-10-30 AND ns=N-1 will give me both
documents (because A-001 has been published in newspaper N-1 (at another
date) and has been published on 2010-10-30 (but in another newspaper))

Is there any way to index the data/express the search/... to be able to find
only document A-002?

In Solr terms, I believe that this is a multi-valued poly field (not yet
in the current stable version 1.4...)

Will this be supported by the next release? (what syntax?)

Some idea that I've had (usable with Solr 1.4)

(1)
Add fields like this for doc A-001:
 field name=combinedN-1/E1/2010-10-31/field
 field name=combinedN-2/E2/2010-10-30/field
and make a wildcard search N-1/*/2010-10-30


this will work for simple queries, but:
. I think that it will not allow range queries: all articles published in
newspaper N-1 between 2009-08-01 and 2010-10-15
. a wildcard query on N-1/E2/* will be very inefficient!
. writing queries will be more difficult (sometimes the user has to use the
field ns, something the field combined,...)


(2)
Make the simple query pubDate=2010-10-30 AND ns=N-1, but filter the
results (the above query will give all correct results, plus some more).
This is not a generic solution, and writing the filter will be difficult if
the query is more complex:
(pubDate=2010-10-31 AND ns=N-1 ) OR (text contains Barack)

(3)
On the same field as (1) here above, use an analyzer that will cheat the
proximity search, in issuing the following terms:

term 1: ns:N-1
term 2: ed:E1
term 3: pubDate:2010-10-31
term 11: ns:N-2
term 12: ed:E2
term 13: pubDate:2010-10-30
...

then a proximity search

(combined:ns:N-1 AND combined:pubDate:2010-10-30)~3

will give me only document A-002, not document A-001

Again, this will make problems with range queries, won't it?

Isn't there any better way to do this?

Ideally, I would index this (with my own syntax...):

doc
field name=idA-001/field
field name=pubDate set=12010-10-31/field
field name=ns set=1N-1/field
field name=ed set=1E1/field
field name=pubDate set=22010-10-30/field
field name=ns set=2N-2/field
field name=ed set=2E2/field
/doc

 and then search:

(pubDate=2010-10-31 AND ns=N-1){sameSet}

or something like this...

I've found references to similar questions, but no answer that I could use
in my case.
(this one being the closer to my problem:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201001.mbox/%3c9b742a34aa31814594f2bed8dfd9cceec96ca...@sparky.office.techtarget.com%3e

or *http://tinyurl.com/3527w4u*)

Thanks in advance for your ideas!
(and sorry for any english mistakes)


Re: SOLR Thesaurus

2010-12-02 Thread Michael Zach
Hello Lee,

these bells sound like SKOS ;o)

AFAIK Solr does not support thesauri just plain flat synonym lists.

One could implement a thesaurus filter and put it into the end of the analyzer 
chain of solr.

The filter would then do a thesaurus lookup for each token it receives and 
possibly 
* expand the query 
or
* kind of stem document tokens to some prefered variants according to the 
thesaurus

Maybe even taking term relations from thesaurus into account and boost queries 
or doc fields at index time.

Maybe have a look at http://poolparty.punkt.at/ a full features SKOS thesaurus 
management server.
It's also providing webservices which could feed such a Solr filter.

Kind regards
Michael


- Ursprüngliche Mail -
Von: lee carroll lee.a.carr...@googlemail.com
An: solr-user@lucene.apache.org
Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
Betreff: SOLR Thesaurus

Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: 
BT: xxx,,
NT: xxx,,
RT:xxx,xxx,xxx
Scope Note: xx,

Like i say bells and whistles

cheers Lee


Re: SOLR Thesaurus

2010-12-02 Thread lee carroll
Hi

Stephen, yes sorry should have been more plain

a term can have a Prefered Term (PT), many Broader Terms (BT), Many Narrower
Terms (NT) Related Terms (RT) etc

So

User supplied Term is say : Ski

Prefered term: Skiing
Broader terms could be : Ski and Snow Boarding, Mountain Sports, Sports
Narrower terms: down hill skiing, telemark, cross country
Related terms: boarding, snow boarding, winter holidays

Michael,

yes exactly, SKOS, although maybe without the over wheening ambition to take
over the world.

By the sounds of it though out of the box you get a simple (but pretty
effective synonym list and ring) Anything more we'd need to write it
ourselfs ie your thesaurus filter and plus a change to the response as
broader terms, narrower terms etc would be good to be suggested to the ui.

No plugins out there ?

On 2 December 2010 16:16, Michael Zach za...@punkt.at wrote:

 Hello Lee,

 these bells sound like SKOS ;o)

 AFAIK Solr does not support thesauri just plain flat synonym lists.

 One could implement a thesaurus filter and put it into the end of the
 analyzer chain of solr.

 The filter would then do a thesaurus lookup for each token it receives and
 possibly
 * expand the query
 or
 * kind of stem document tokens to some prefered variants according to the
 thesaurus

 Maybe even taking term relations from thesaurus into account and boost
 queries or doc fields at index time.

 Maybe have a look at http://poolparty.punkt.at/ a full features SKOS
 thesaurus management server.
 It's also providing webservices which could feed such a Solr filter.

 Kind regards
 Michael


 - Ursprüngliche Mail -
 Von: lee carroll lee.a.carr...@googlemail.com
 An: solr-user@lucene.apache.org
 Gesendet: Donnerstag, 2. Dezember 2010 09:55:54
 Betreff: SOLR Thesaurus

 Hi List,

 Coming to and end of a proto type evaluation of SOLR (all very good etc
 etc)
 Getting to the point at looking at bells and whistles. Does SOLR have a
 thesuarus. Cant find any refrerence
 to one in the docs and on the wiki etc. (Apart from a few mail threads
 which
 describe the synonym.txt as a thesuarus)

 I mean something like:

 PT: 
 BT: xxx,,
 NT: xxx,,
 RT:xxx,xxx,xxx
 Scope Note: xx,

 Like i say bells and whistles

 cheers Lee



Import Data Into Solr

2010-12-02 Thread Bing Li
Hi, all,

I am a new user of Solr. Before using it, all of the data is indexed myself
with Lucene. According to the Chapter 3 of the book, Solr. 1.4 Enterprise
Search Server written by David Smiley and Eric Pugh, data in the formats of
XML, CSV and even PDF, etc, can be imported to Solr.

If I wish to import the Lucene indexes into Solr, may I have any other
approaches? I know that Solr is a serverized Lucene.

Thanks,
Bing Li


Re: Dinamically change master

2010-12-02 Thread Tommaso Teofili
Back with my master resiliency need, talking with Upayavira we discovered we
were proposing the same solution :-)
This can be useful if you don't have a VIP with master/backup polling
policy.
It goes like this: there are 2 host for indexing, one is the main and one is
the backup one, the backup one is slave of the main one and the main one is
also master of N hosts which will be used for searching. If the main master
goes down then the backup one will be used for indexing and/or serving
search slaves.
This last feature can be done defining an external properties file for each
search slave which will contain the URL to master (pointed inside the
replication request handler tag of solrconfig.xml), so if these search
slaves run on multi core one has only to change properties file URL and
issue a http://SLAVEURL/solr/admin/cores?action=RELOADcore=core0 to get
polling the backup master.
Cheers,
Tommaso



2010/12/1 Tommaso Teofili tommaso.teof...@gmail.com

 Thanks Upayavira, that sounds very good.

 p.s.:
 I read that page some weeks ago and didn't get back to check on it.


 2010/12/1 Upayavira u...@odoko.co.uk

 Note, all extracted from http://wiki.apache.org/solr/SolrReplication

 You'd put:

 requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
!--Replicate on 'startup' and 'commit'. 'optimize' is also a
valid value for replicateAfter. --
str name=replicateAfterstartup/str
str name=replicateAftercommit/str
/lst
 /requestHandler

 into every box you want to be able to act as a master, then use:

 http://slave_host:port
 /solr/replication?command=fetchindexmasterUrl=your
 master URL

 As the above page says better than I can, It is possible to pass on
 extra attribute 'masterUrl' or other attributes like 'compression' (or
 any other parameter which is specified in the lst name=slave tag) to
 do a one time replication from a master. This obviates the need for
 hardcoding the master in the slave.

 HTH, Upayavira

 On Wed, 01 Dec 2010 06:24 +0100, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Hi Upayavira,
  this is a good start for solving my problem, can you please tell how
 does
  such a replication URL look like?
  Thanks,
  Tommaso
 
  2010/12/1 Upayavira u...@odoko.co.uk
 
   Hi Tommaso,
  
   I believe you can tell each server to act as a master (which means it
   can have its indexes pulled from it).
  
   You can then include the master hostname in the URL that triggers a
   replication process. Thus, if you triggered replication from outside
   solr, you'd have control over which master you pull from.
  
   Does this answer your question?
  
   Upayavira
  
  
   On Tue, 30 Nov 2010 09:18 -0800, Ken Krugler
   kkrugler_li...@transpac.com wrote:
Hi Tommaso,
   
On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote:
   
 Hi all,

 in a replication environment if the host where the master is
 running
 goes
 down for some reason, is there a way to communicate to the slaves
 to
 point
 to a different (backup) master without manually changing
 configuration (and
 restarting the slaves or their cores)?

 Basically I'd like to be able to change the replication master
 dinamically
 inside the slaves.

 Do you have any idea of how this could be achieved?
   
One common approach is to use VIP (virtual IP) support provided by
load balancers.
   
Your slaves are configured to use a VIP to talk to the master, so
 that
it's easy to dynamically change which master they use, via updates
 to
the load balancer config.
   
-- Ken
   
--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
   
   
   
   
   
   
  
 





TermsComponent prefix query with fileds analyzers

2010-12-02 Thread Nestor Oviedo
Hi everyone
Does anyone know how to apply some analyzers over a prefix query?
What I'm looking for is a way to build an autosuggest using the
termsComponent that could be able to remove the accents from the
query's prefix.
For example, I have the term analisis in the index and I want to
retrieve it with the prefix Análi (notice the accent in the third
letter).
I think the regexp function won't help here, so I was wondering if
specifying some analyzers (LowerCase and ASCIIFolding) in the
termComponents configuration, it would be applied over the prefix.

Thanks in advance.
Nestor


disabled replication setting

2010-12-02 Thread Xin Li
For solr replication, we can send command to disable replication. Does
anyone know where i can verify the replication enabled/disabled
setting? i cannot seem to find it on dashboard or details command
output.

Thanks,

Xin


Exceptions in Embedded Solr

2010-12-02 Thread Tharindu Mathew
Hi everyone,

I get the exception below when using Embedded Solr suddenly. If I
delete the Solr index it goes back to normal, but it obviously has to
start indexing from scratch. Any idea what the cause of this is?

java.lang.RuntimeException: java.io.FileNotFoundException:
/home/evanthika/WSO2/CARBON/GREG/3.6.0/23-11-2010/normal/wso2greg-3.6.0/solr/data/index/segments_2
(No such file or directory)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
at org.apache.solr.core.SolrCore.init(SolrCore.java:579)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at org.wso2.carbon.registry.indexing.solr.SolrClient.init(SolrClient.java:103)
at 
org.wso2.carbon.registry.indexing.solr.SolrClient.getInstance(SolrClient.java:115)
... 44 more
Caused by: java.io.FileNotFoundException:
/home/evanthika/WSO2/CARBON/GREG/3.6.0/23-11-2010/normal/wso2greg-3.6.0/solr/data/index/segments_2
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:212)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.init(SimpleFSDirectory.java:78)
at 
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.init(SimpleFSDirectory.java:108)
at 
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.init(NIOFSDirectory.java:94)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:70)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:691)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:236)
at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:403)
at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057)
... 48 more

[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.SolrCore} -
REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@58f24b6
(null) has a reference count of 1
[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.SolrCore} -
REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@654dbbf6
(null) has a reference count of 1
[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.CoreContainer} -
CoreContainer was not shutdown prior to finalize(), indicates a bug --
POSSIBLE RESOURCE LEAK!!!
[2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.CoreContainer} -
CoreContainer was not shutdown prior to finalize(), indicates a bug --
POSSIBLE RESOURCE LEAK!!!

-- 
Regards,

Tharindu



-- 
Regards,

Tharindu


RE: disabled replication setting

2010-12-02 Thread Xin Li
Does anything know?

Thanks,

-Original Message-
From: Xin Li [mailto:xin.li@gmail.com] 
Sent: Thursday, December 02, 2010 12:25 PM
To: solr-user@lucene.apache.org
Subject: disabled replication setting

For solr replication, we can send command to disable replication. Does
anyone know where i can verify the replication enabled/disabled
setting? i cannot seem to find it on dashboard or details command
output.

Thanks,

Xin

This electronic mail message contains information that (a) is or
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
PROTECTED
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
the
addressee(s) named herein.  If you are not an intended recipient,
please contact the sender immediately and take the steps
necessary
to delete the message completely from your computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the
Uniform Electronic Transaction Act or any other law of similar
effect, absent an express statement to the contrary, this e-mail
message, its contents, and any attachments hereto are not
intended
to represent an offer or acceptance to enter into a contract and
are not otherwise intended to bind this sender,
barnesandnoble.com
llc, barnesandnoble.com inc. or any other person or entity.


Re: Return Lucene DocId in Solr Results

2010-12-02 Thread Erick Erickson
You have to call termDocs.next() after termDocs.seek. Something like
termDocs.seek().
if (termDocs.next()) {
   // means there was a term/doc matching and your references should be
valid.
}

On Thu, Dec 2, 2010 at 10:22 AM, Lohrenz, Steven
steven.lohr...@hmhpub.comwrote:

 I must be missing something as I'm getting a NPE on the line: docIds[i] =
 termDocs.doc();
 here's what I came up with:

 private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req, ListFavorites
 favsBeans) throws ParseException {
// open the core  get data directory
String indexDir = req.getCore().getIndexDir();

 FSDirectory indexDirectory = null;
try {
indexDirectory = FSDirectory.open(new File(indexDir));
 } catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:
  + indexDir +   + e.getMessage());
}

 //String pkQueryString = resourceId: + favBean.getResourceId();
 //Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
 resourceId, new StandardAnalyzer()).parse(pkQueryString);

IndexSearcher searcher = null;
TopScoreDocCollector collector = null;
 IndexReader indexReader = null;
TermDocs termDocs = null;

try {
searcher = new IndexSearcher(indexDirectory, true);
indexReader = new FilterIndexReader(searcher.getIndexReader());
termDocs = indexReader.termDocs();
 } catch (IOException e) {
throw new ParseException(IOException, cannot open the index at:
  + indexDir +   + e.getMessage());
}

int[] docIds = new int[favsBeans.size()];
int i = 0;
for(Favorites favBean: favsBeans) {
 Term term = new Term(resourceId, favBean.getResourceId());
try {
termDocs.seek(term);
docIds[i] = termDocs.doc();
} catch (IOException e) {
throw new ParseException(IOException, cannot seek to the
 primary key  + favBean.getResourceId() +  in :  + indexDir +   +
 e.getMessage());
 }
//ScoreDoc[] hits = collector.topDocs().scoreDocs;
//if(hits != null  hits[0] != null) {

 i++;
//}
}

Arrays.sort(docIds);
return docIds;
}

 Thanks,
 Steve
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: 02 December 2010 14:20
 To: solr-user@lucene.apache.org
 Subject: Re: Return Lucene DocId in Solr Results

 Ahhh, you're already down in Lucene. That makes things easier...

 See TermDocs. Particularly seek(Term). That'll directly access the indexed
 unique key rather than having to form a bunch of queries.

 Best
 Erick


 On Thu, Dec 2, 2010 at 8:59 AM, Lohrenz, Steven
 steven.lohr...@hmhpub.comwrote:

  I would be interested in hearing about some ways to improve the
 algorithm.
  I have done a very straightforward Lucene query within a loop to get the
  docIds.
 
  Here's what I did to get it working where favsBean are objects returned
  from a query of the second core, but there is probably a better way to do
  it:
 
  private int[] getDocIdsFromPrimaryKey(SolrQueryRequest req,
 ListFavorites
  favsBeans) throws ParseException {
 // open the core  get data directory
 String indexDir = req.getCore().getIndexDir();
 FSDirectory index = null;
 try {
 index = FSDirectory.open(new File(indexDir));
 } catch (IOException e) {
 throw new ParseException(IOException, cannot open the index
 at:
   + indexDir +   + e.getMessage());
 }
 
 int[] docIds = new int[favsBeans.size()];
 int i = 0;
 for(Favorites favBean: favsBeans) {
 String pkQueryString = resourceId: +
 favBean.getResourceId();
 Query pkQuery = new QueryParser(Version.LUCENE_CURRENT,
  resourceId, new StandardAnalyzer()).parse(pkQueryString);
 
 IndexSearcher searcher = null;
 TopScoreDocCollector collector = null;
 try {
 searcher = new IndexSearcher(index, true);
 collector = TopScoreDocCollector.create(1, true);
 searcher.search(pkQuery, collector);
 } catch (IOException e) {
 throw new ParseException(IOException, cannot search the
  index at:  + indexDir +   + e.getMessage());
 }
 
 ScoreDoc[] hits = collector.topDocs().scoreDocs;
 if(hits != null  hits[0] != null) {
 docIds[i] = hits[0].doc;
 i++;
 }
 }
 
 Arrays.sort(docIds);
 return docIds;
  }
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: 02 December 2010 13:46
  To: solr-user@lucene.apache.org
  Subject: Re: Return Lucene DocId in Solr Results
 
  Sounds good, especially because your old scenario was fragile. The doc
 IDs
  in
  

Re: Import Data Into Solr

2010-12-02 Thread Erick Erickson
You can just point your Solr instance at your Lucene index. Really, copy the
Lucene index into the right place to be found by solr.

HOWEVER, you need to take great care that the field definitions that you
used
when you built your Lucene index are compatible with the ones configured in
your
schema.xml file. This is NOT a trivial task.

I'd recommend that you try having Solr build your index, you'll probably
want to
sometime in the future anyway so you might as well bite the bullet now if
possible...

Plus, I'm not quite sure about version index issues.

Best
Erick

On Thu, Dec 2, 2010 at 11:54 AM, Bing Li lbl...@gmail.com wrote:

 Hi, all,

 I am a new user of Solr. Before using it, all of the data is indexed myself
 with Lucene. According to the Chapter 3 of the book, Solr. 1.4 Enterprise
 Search Server written by David Smiley and Eric Pugh, data in the formats of
 XML, CSV and even PDF, etc, can be imported to Solr.

 If I wish to import the Lucene indexes into Solr, may I have any other
 approaches? I know that Solr is a serverized Lucene.

 Thanks,
 Bing Li



Re: SOLR Thesaurus

2010-12-02 Thread Jonathan Rochkind
No, it doesn't.  And it's not entirely clear what (if any) simple way 
there is to use Solr to expose hieararchically related documents in a 
way that preserves and usefully allows navigation of the relationships.  
At least in general, for sophisticated stuff.


On 12/2/2010 3:55 AM, lee carroll wrote:

Hi List,

Coming to and end of a proto type evaluation of SOLR (all very good etc etc)
Getting to the point at looking at bells and whistles. Does SOLR have a
thesuarus. Cant find any refrerence
to one in the docs and on the wiki etc. (Apart from a few mail threads which
describe the synonym.txt as a thesuarus)

I mean something like:

PT: 
BT: xxx,,
NT: xxx,,
RT:xxx,xxx,xxx
Scope Note: xx,

Like i say bells and whistles

cheers Lee



Re: TermsComponent prefix query with fileds analyzers

2010-12-02 Thread Jonathan Rochkind
I don't believe you can.  If you just need query-time transformation, 
can't you just do it in your client app? If you need index-time 
transformation... well, you can do that, but it's up to your schema.xml 
and will of course apply to the field as a whole, not just for 
termscomponent queries, because that's just how solr works.


I'd note for your example, you'll also have to lowercase that capital A 
if you want it to match a lowercased a in a termscomponent prefix query.


To my mind (others may disagree), robust flexible auto-complete like 
this is still a somewhat unsolved problem in Solr, the termscomponent 
approach has it's definite limitations.


On 12/2/2010 12:24 PM, Nestor Oviedo wrote:

Hi everyone
Does anyone know how to apply some analyzers over a prefix query?
What I'm looking for is a way to build an autosuggest using the
termsComponent that could be able to remove the accents from the
query's prefix.
For example, I have the term analisis in the index and I want to
retrieve it with the prefix Análi (notice the accent in the third
letter).
I think the regexp function won't help here, so I was wondering if
specifying some analyzers (LowerCase and ASCIIFolding) in the
termComponents configuration, it would be applied over the prefix.

Thanks in advance.
Nestor



RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Burton-West, Tom
Hi Mike,

We turned on infostream.   Is there documentation about how to interpret it, or 
should I just grep through the codebase?

Is the excerpt below what I am looking for as far as understanding the 
relationship between ramBufferSize and size on disk?
is newFlushedSize the size on disk in bytes?


DW:   ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55%

RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 
byteBlockFre
e=0.125 perDocFree=0.006 charBlockFree=0
...
DW: after free: freedMB=0.225 usedMB=325.82
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]: flush: now pause all indexing threads
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]:   flush: segment=_5h docStoreSegment=_5e 
docStoreOffset=266 flushDocs=true flushDeletes=false 
flushDocStores=false numDocs=40 numBufDelTerms=40
... Dec 1, 2010 5:40:22 PM   purge field=geographic
Dec 1, 2010 5:40:22 PM   purge field=serialTitle_ab
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: DW:   ramUsed=325.772 MB newFlushedSize=69848046 
docs/MB=0.6 new/old=20.447%
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, 
_5h.fnm, _5h.tii]



Tom


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 3:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom tburt...@umich.edu wrote:
 Thanks Mike,

 Yes we have many unique terms due to dirty OCR and 400 languages and probably 
 lots of low doc freq terms as well (although with the ICUTokenizer and 
 ICUFoldingFilter we should get fewer terms due to bad tokenization and 
 normalization.)

OK likely this explains the lowish RAM efficiency.

 Is this additional overhead because each unique term takes a certain amount 
 of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish startup cost for each term but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

 Does turning on IndexWriters infostream have a significant impact on memory 
 use or indexing speed?

I don't believe so

Mike


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Yonik Seeley
On Wed, Dec 1, 2010 at 3:01 PM, Shawn Heisey s...@elyograg.org wrote:
 I have seen this.  In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not
 segment, but all the other files do.  I can't remember whether it behaves
 the same under 3.1, or whether it also creates these files in each segment.

Yep, that's the shared doc store (where stored fields go.. the
non-inverted part of the index), and it works like that in 3.x and
trunk too.
It's nice because when you merge segments, you don't have to re-copy
the docs (provided you're within a single indexing session).
There have been discussions about removing it in trunk though... we'll see.

-Yonik
http://www.lucidimagination.com


Joining Fields in and Index

2010-12-02 Thread Adam Estrada
All,

I have an index that has a field with country codes in it. I have 7 million or 
so documents in the index and when displaying facets the country codes don't 
mean a whole lot to me. Is there any way to add a field with the full country 
names then join the codes in there accordingly? I suppose I can do this before 
updating the records in the index but before I do that I would like to know if 
there is a way to do this sort of join. 

Example: US - United States

Thanks,
Adam

Re: Joining Fields in and Index

2010-12-02 Thread Savvas-Andreas Moysidis
Hi,

If you are able to do a full re-index then you could index the full names
and not the codes. When you later facet on the Country field you'll get the
actual name rather than the code.
If you are not able to re-index then probably this conversion could be added
at your application layer prior to displaying your results.(e.g. in your DAO
object)

On 2 December 2010 22:05, Adam Estrada estrada.adam.gro...@gmail.comwrote:

 All,

 I have an index that has a field with country codes in it. I have 7 million
 or so documents in the index and when displaying facets the country codes
 don't mean a whole lot to me. Is there any way to add a field with the full
 country names then join the codes in there accordingly? I suppose I can do
 this before updating the records in the index but before I do that I would
 like to know if there is a way to do this sort of join.

 Example: US - United States

 Thanks,
 Adam


Re: Joining Fields in and Index

2010-12-02 Thread Adam Estrada
Hi,

I was hoping to do it directly in the index but it was more out of curiosity 
than anything. I can certainly map it in the DAO but again...I was hoping to 
learn if it was possible in the index.

Thanks for the feedback!

Adam

On Dec 2, 2010, at 5:48 PM, Savvas-Andreas Moysidis wrote:

 Hi,
 
 If you are able to do a full re-index then you could index the full names
 and not the codes. When you later facet on the Country field you'll get the
 actual name rather than the code.
 If you are not able to re-index then probably this conversion could be added
 at your application layer prior to displaying your results.(e.g. in your DAO
 object)
 
 On 2 December 2010 22:05, Adam Estrada estrada.adam.gro...@gmail.comwrote:
 
 All,
 
 I have an index that has a field with country codes in it. I have 7 million
 or so documents in the index and when displaying facets the country codes
 don't mean a whole lot to me. Is there any way to add a field with the full
 country names then join the codes in there accordingly? I suppose I can do
 this before updating the records in the index but before I do that I would
 like to know if there is a way to do this sort of join.
 
 Example: US - United States
 
 Thanks,
 Adam



Re: spatial query parinsg error: org.apache.lucene.queryParser.ParseException

2010-12-02 Thread Dennis Gearon
It WORKED

Thank you so much everybody!

I feel like jumping up and down like 'Hiro' on Heroes

 Dennis Gearon
- Original Message - From: Dennis Gearon gear...@sbcglobal.net

To: solr-user@lucene.apache.org
Sent: Wednesday, December 01, 2010 7:51 PM
Subject: spatial query parinsg error: 
org.apache.lucene.queryParser.ParseException


I am trying to get spatial search to work on my Solr installation. I am running
version 1.4.1 with the Jayway Team spatial-solr-plugin. I am performing the
search with the following url:

http://localhost:8080/solr/select?wt=jsonindent=trueq=title:Art%20Loft{!spatial%20lat=37.326375%20lng=-121.892639%20radius=3%20unit=km%20threadCount=3}



The result that I get is the following error:

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse
'title:Art Loft{!spatial lat=37.326375 lng=-121.892639 radius=3 unit=km
threadCount=3}': Encountered  RANGEEX_GOOP lng=-121.892639  at line 1,
column 38. Was expecting: }

Not sure why it would be complaining about the lng parameter in the query. I
double-checked to make sure that I had the right name for the longitude field in
my solrconfig.xml file.

Any help/suggestions would be greatly appreciated

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.


Cannot start Solr anymore

2010-12-02 Thread Ruixiang Zhang
Hi, I'm new here.

First, could anyone tell me how to restart solr?

I started solr and killed the process. Then when I tried to start it again,
it failed:

$ java -jar start.jar
2010-12-02 14:28:00.011::INFO:  Logging to STDERR via
org.mortbay.log.StdErrLog
2010-12-02 14:28:00.099::INFO:  jetty-6.1.3
2010-12-02 14:28:00.231::WARN:  Failed startup of context
org.mortbay.jetty.webapp.webappcont...@73901437
{/solr,jar:file:/.../solr/apache-solr-1.4.1/example/webapps/solr.war!/}
java.util.zip.ZipException: invalid END header (bad central directory
offset)


Thanks
Richard


Re: TermsComponent prefix query with fileds analyzers

2010-12-02 Thread Ahmet Arslan
 Does anyone know how to apply some analyzers over a prefix
 query?

Lucene has an special QueryParser for this.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

Someone provided a patch to use it in solr. It was an attachment to a thread at 
nabble. I couldn't find it now.

Similar discussion : http://search-lucene.com/m/oMtRJQPgGb1/


  


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Koji Sekiguchi

(10/12/03 8:58), Ruixiang Zhang wrote:

I tried to import data from mysql. When I tried to run
http://mydomain.com:8983/solr/admin/dataimport , I got these error message:

HTTP ERROR: 404

NOT_FOUND

RequestURI=/solr/admin/dataimport

*Powered by Jetty://http://jetty.mortbay.org/


*
Any help will be appreciated!!!
Thanks
Richard


Richard,

Usually, it should be http://mydomain.com:8983/solr/dataimport

Koji
--
http://www.rondhuit.com/en/


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Ruixiang Zhang
Hi Koji

Thanks for your reply.
I pasted the wrong link.
Actually I tried this fist http://mydomain.com:8983/solr/dataimport
It didn't work.
The page should be there after installation, right? Did I miss something?

Thanks a lot!
Richard





On Thu, Dec 2, 2010 at 4:23 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (10/12/03 8:58), Ruixiang Zhang wrote:

 I tried to import data from mysql. When I tried to run
 http://mydomain.com:8983/solr/admin/dataimport , I got these error
 message:

 HTTP ERROR: 404

 NOT_FOUND

 RequestURI=/solr/admin/dataimport

 *Powered by Jetty://http://jetty.mortbay.org/



 *
 Any help will be appreciated!!!
 Thanks
 Richard


 Richard,

 Usually, it should be http://mydomain.com:8983/solr/dataimport

 Koji
 --
 http://www.rondhuit.com/en/



Re: solr/admin/dataimport Not Found

2010-12-02 Thread Koji Sekiguchi

(10/12/03 9:29), Ruixiang Zhang wrote:

Hi Koji

Thanks for your reply.
I pasted the wrong link.
Actually I tried this fist http://mydomain.com:8983/solr/dataimport
It didn't work.
The page should be there after installation, right? Did I miss something?

Thanks a lot!
Richard


To work that URL, you have to have a request handler in your solrconfig.xml:

   requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdb-data-config.xml/str
/lst
  /requestHandler

If you try DIH for the first time, please read 
solr/example/example-DIH/README.txt
and try example-DIH first.

Koji
--
http://www.rondhuit.com/en/


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:31 PM, Burton-West, Tom tburt...@umich.edu wrote:

 We turned on infostream.   Is there documentation about how to interpret it, 
 or should I just grep through the codebase?

There isn't any documentation... and it changes over time as we add
new diagnostics.

 Is the excerpt below what I am looking for as far as understanding the 
 relationship between ramBufferSize and size on disk?
 is newFlushedSize the size on disk in bytes?

Yes -- so IW's buffer was using 329.782 MB RAM, and was flushed to a
69,848,046 byte segment.

Mike


Re: solr/admin/dataimport Not Found

2010-12-02 Thread Ruixiang Zhang
Thank you so much, Koji, the example-DIH works. I'm reading for details...

Richard


On Thu, Dec 2, 2010 at 4:39 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (10/12/03 9:29), Ruixiang Zhang wrote:

 Hi Koji

 Thanks for your reply.
 I pasted the wrong link.
 Actually I tried this fist http://mydomain.com:8983/solr/dataimport
 It didn't work.
 The page should be there after installation, right? Did I miss something?

 Thanks a lot!
 Richard


 To work that URL, you have to have a request handler in your
 solrconfig.xml:

   requestHandler name=/dataimport
 class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdb-data-config.xml/str
/lst
  /requestHandler

 If you try DIH for the first time, please read
 solr/example/example-DIH/README.txt
 and try example-DIH first.


 Koji
 --
 http://www.rondhuit.com/en/



Limit number of characters returned

2010-12-02 Thread Mark

Is there way to limit the number of characters returned from a stored field?

For example:

Say I have a document (~2K words) and I search for a word that's 
somewhere in the middle. I would like the document to match the search 
query but the stored field should only return the first 200 characters 
of the document. Is there anyway to accomplish this that doesn't involve 
two fields?


Thanks


PDF text extracted without spaces

2010-12-02 Thread Ganesh
Hello all,

I know, this is not the right group to ask this question, thought some of you 
guys might have experienced.  

I newbie with Tika. I am using latest version 0.8 version. I extracted text 
from PDF document but found spaces and new line missing. Indexing the data 
gives wrong result. Could any one in this group could help me? I am using tika 
directly to extract the contents, which later gets indexed.

Regards
Ganesh
Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download 
Now! http://messenger.yahoo.com/download.php


Solr Multi-thread Update Transaction Control

2010-12-02 Thread wangjb

Hi,
 Now we are using solr1.4.1, and encounter a problem.
 When multi-threads update solr data at the same time, can every thread 
have its separate transaction?

 If this is possible, how can we realize this.
 Is there any suggestion here?
 Waiting online.
 Thank you for any useful reply.





Query performance very slow even after autowarming

2010-12-02 Thread johnnyisrael

Hi,

I am using edgeNgramFilterfactory on SOLR 1.4.1 [filter
class=solr.EdgeNGramFilterFactory maxGramSize=100 minGramSize=1 /]
for my indexing.

Each document will have about 5 fields in it and only one field is indexed
with EdgeNGramFilterFactory.

I have about 1.4 million documents in my index now and my index size is
approx 296MB.

I made the field that is indexed with EdgeNGramFilterFactory as default
search field. All my query responses are very slow, some of them taking more
than 10seconds to respond. 

All my query responses are very slow, Queries with single letters are still
very slow.

/select/?q=m

So I tried query warming as follows.

listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
lststr name=qa/str/lst
lststr name=qb/str/lst
lststr name=qc/str/lst
lststr name=qd/str/lst
lststr name=qe/str/lst
lststr name=qf/str/lst
lststr name=qg/str/lst
lststr name=qh/str/lst
lststr name=qi/str/lst
lststr name=qj/str/lst
lststr name=qk/str/lst
lststr name=ql/str/lst
lststr name=qm/str/lst
lststr name=qn/str/lst
lststr name=qo/str/lst
lststr name=qp/str/lst
lststr name=qq/str/lst
lststr name=qr/str/lst
lststr name=qs/str/lst
lststr name=qt/str/lst
lststr name=qu/str/lst
lststr name=qv/str/lst
lststr name=qw/str/lst
lststr name=qx/str/lst
lststr name=qy/str/lst
lststr name=qz/str/lst
  /arr
/listener

The same above is done for firstSearcher as well.

My cache settings are as follows.

filterCache
  class=solr.LRUCache
  size=16384
  initialSize=4096
autowarmCount=4096/

queryResultCache
  class=solr.LRUCache
  size=16384
  initialSize=4096
autowarmCount=1024/

documentCache
  class=solr.LRUCache
  size=16384
  initialSize=16384
/

Still after query warming, few single character search is taking up to 3
seconds to respond.

Am i doing anything wrong in my cache setting or autowarm setting or am i
missing anything here?

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-performance-very-slow-even-after-autowarming-tp2010384p2010384.html
Sent from the Solr - User mailing list archive at Nabble.com.