Re: How to list all dynamic fields of a document using solrj?

2011-08-30 Thread Michael Szalay
Hi Juan

I tried with the following code first:

final SolrQuery allDocumentsQuery = new  SolrQuery();
allDocumentsQuery.setQuery(id: + myId);
allDocumentsQuery.setFields(*);
allDocumentsQuery.setRows(1);
QueryResponse response = solr.query(allDocumentsQuery, METHOD.POST);


With this, only non-dynamic fields are returned.
Then I wrote the following helper method:

 private SetString getDynamicFields() throws SolrServerException, IOException 
{
final LukeRequest luke = new LukeRequest();
luke.setShowSchema(false);
final LukeResponse process = luke.process(solr);
final MapString, FieldInfo fieldInfo = process.getFieldInfo();
final SetString dynamicFields = new HashSetString();
for (final String key : fieldInfo.keySet()) {
if (key.endsWith(_string) || (key.endsWith(_dateTime))) {
dynamicFields.add(key);
}
}
return dynamicFields;
}

where as _string and _dateTime are the suffixes of my dynamic fields.
This one returns really all stored fields of the document:

final SetString dynamicFields = getDynamicFields();
final SolrQuery allDocumentsQuery = new  SolrQuery();
allDocumentsQuery.setQuery(uri: + myId);
allDocumentsQuery.setFields(*);
for (final String df : dynamicFields) {
allDocumentsQuery.addField(df);
}

allDocumentsQuery.setRows(1);
QueryResponse response = solr.query(allDocumentsQuery, METHOD.POST);

Is there a more elegant way to do this? We are using solrj 3.1.0 and solr 3.1.0.

Regards
Michael
--
Michael Szalay
Senior Software Engineer

basis06 AG, Birkenweg 61, CH-3013 Bern - Fon +41 31 311 32 22
http://www.basis06.ch - source of smart business

- Ursprüngliche Mail -
Von: Juan Grande juan.gra...@gmail.com
An: solr-user@lucene.apache.org
Gesendet: Montag, 29. August 2011 18:19:05
Betreff: Re: How to list all dynamic fields of a document using solrj?

Hi Michael,

It's supposed to work. Can we see a snippet of the code you're using to
retrieve the fields?

*Juan*



On Mon, Aug 29, 2011 at 8:33 AM, Michael Szalay
michael.sza...@basis06.chwrote:

 Hi all

 how can I list all dynamic fields and their values of a document using
 solrj?
 The dynamic fields are never returned when I use setFields(*).

 Thanks

 Michael

 --
 Michael Szalay
 Senior Software Engineer

 basis06 AG, Birkenweg 61, CH-3013 Bern - Fon +41 31 311 32 22
 http://www.basis06.ch - source of smart business




Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-30 Thread Pranav Prakash
Solr 3.3. has a feature Grouping. Is it practically same as deduplication?

Here is my use case for duplicates removal -

We have many documents with similar (upto 99%) content. Upon some search
queries, almost all of them come up on first page results. Of all these
documents, essentially one is original and the other are duplicates. We are
able to find the original content on a basis of number of factors - who
uploaded it, when, how many viral shares.It is also possible that the
duplicates are uploaded earlier (and hence exist in search index) while the
original is uploaded later (and gets added later to index).

AFAIK, Deduplication targets index time. Is there a means I can specify the
original which should be returned and the duplicates which could be removed
from coming up.?


*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


Re: Does Solr flush to disk even before ramBufferSizeMB is hit?

2011-08-30 Thread roz dev
Thanks Shawn.

If Solr writes this info to Disk as soon as possible (which is what I am
seeing) then ramBuffer setting seems to be misleading.

Anyone else has any thoughts on this?

-Saroj


On Mon, Aug 29, 2011 at 6:14 AM, Shawn Heisey s...@elyograg.org wrote:

 On 8/28/2011 11:18 PM, roz dev wrote:

 I notice that even though InfoStream does not mention that data is being
 flushed to disk, new segment files were created on the server.
 Size of these files kept growing even though there was enough Heap
 available
 and 856MB Ram was not even used.


 With the caveat that I am not an expert and someone may correct me, I'll
 offer this:  It's been my experience that Solr will write the files that
 constitute stored fields as soon as they are available, because that
 information is always the same and nothing will change in those files based
 on the next chunk of data.

 Thanks,
 Shawn




Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Marc Jacobs
Hi all,

Currently I'm testing Solr's indexing performance, but unfortunately I'm
running into memory problems.
It looks like Solr is not closing the filestream after an exception, but I'm
not really sure.

The current system I'm using has 150GB of memory and while I'm indexing the
memoryconsumption is growing and growing (eventually more then 50GB).
In the attached graph I indexed about 70k of office-documents (pdf,doc,xls
etc) and between 1 and 2 percent throws an exception.
The commits are after 64MB, 60 seconds or after a job (there are 6 evenly
divided jobs).

After indexing the memoryconsumption isn't dropping. Even after an optimize
command it's still there.
What am I doing wrong? I can't imagine I'm the only one with this problem.
Thanks in advance!

Kind regards,

Marc


Re: Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-30 Thread Marc Sturlese
Deduplication uses lucene indexWriter.updateDocument using the signature
term. I don't think it's possible as a default feature to choose wich
document to index, the original should be always the last to be indexed.
/IndexWriter.updateDocument
Updates a document by first deleting the document(s) containing term and
then adding the new document. The delete and then add are atomic as seen by
a reader on the same index (flush may happen only after the add)./

With grouping you have all your documents indexed so it gives you more
flexibility

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-3-Grouping-vs-DeDuplication-and-Deduplication-Use-Case-tp3294711p3295023.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: index full text as possible

2011-08-30 Thread Erick Erickson
For phrase queries, you simply surround the text with
double quotes e.g. this is a phrase...

Best
Erick

2011/8/29 Rode González r...@libnova.es:
 Hi again.

In that case, you should be able to use a tokeniser to split
the input into phrases, though you will probably need to write
a custom tokeniser, depending on what characters you want to
break phrases at. Please see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 I have read this page but I didn't see anything.
 I thought it was a filter implemented.


It is also entirely possible to index the full text, and just do a
phrase search later. This is probably the easiest option, unless
you have a huge volume of text, and the volume of phrases to
be indexed can be significantly lower.

 How can I do that?

 Thanks.
 Rode.





Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread Erick Erickson
Why doesn't the singleton approach we talked about a few
days ago do this? It would create the object the first
time you asked for it, and return you the same one thereafter

Best
Erick

On Mon, Aug 29, 2011 at 11:04 AM, samuele.mattiuzzo samum...@gmail.com wrote:
 it's how i'm doing it now... but i'm not sure i'm placing the objects into
 the right place

 significant part of my code here : http://pastie.org/2448984

 (i've omitted the methods implementations since are pretty long)

 inside the method setLocation, i create the connection to mysql database

 inside the method setFieldPosition, i create the categorization object

 Then i started thinking i was creating and deleting those objects locally
 everytime solr reads a document to index. So, where should i put them?
 inside the tothegocustom class constructor, after the super call?

 I'm asking this because i'm not sure if my custom updaterequestprocessor is
 created once or for everydocument parsed (i'm still learning solr, but i
 think i'm getting into it, bits per bits!)

 Thanks again!

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3292928.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Geodist

2011-08-30 Thread Erick Erickson
Why couldn't you just give an outrageous distance (10) or something?

You have to have some kind of point you're asking for the distance
*from*, don't you?

Best
Erick

On Mon, Aug 29, 2011 at 5:09 PM, solrnovice manisha...@yahoo.com wrote:
 Eric, thanks for the update,  I thought solr 4.0 should have the pseudo
 columns and i am using the right version. So did you ever worked on a query
 to return distance, where there is no long, lat are used in the where
 clause. I mean not in a radial search, but a city search, but displayed the
 distance. So my thought was to pass in the long and lat to the geodist and
 also the coordinates(long and lat)  of every record, and let geodist compute
 the distance. Can you please let me know if this worked for you?


 thanks
 SN

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3293779.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr warming when using master/slave replication

2011-08-30 Thread Erick Erickson
Will traffic be served with a non warmed index searcher at any
point

No. That's what auto-warming is all about.

More correctly, it depends on how you configure things in
your config file. There are entries like firstSearcher,
newSearcher and various autowarm counts, all of which
you set and the various actions specified are carried out
before the switch is made to the new searcher after replication.

There's also useColdSearchers if you want to specifically
NOT wait for warmup. As I said, it depends on how you
configure things.

In fact, this will lead to a temporary increase in memory
use, since the old and new caches will both be in memory for
a short time.

Best
Erick

On Mon, Aug 29, 2011 at 5:54 PM, Mike Austin mike.aus...@juggle.com wrote:
 Correction: Will traffic be served with a non warmed index searcher at any
 point?

 Thanks,
 Mike

 On Mon, Aug 29, 2011 at 4:52 PM, Mike Austin mike.aus...@juggle.com wrote:

 Distribution/Replication gives you a 'new' index on the slave. When Solr
 is told to use the new index, the old caches have to be discarded along with
 the old Index Searcher. That's when autowarming occurs.  If the current
 Index Searcher is serving requests and when a new searcher is opened, the
 new one is 'warmed' while the current one is serving external requests. When
 the new one is ready, it is registered so it can serve any new requests
 while the original one first finishes the requests it is handling. 

 So if warming is configured, the new index will warm before going live?
 How does that work with the copying to the new directory? Does it get warmed
 while in the temp directory before copied over?  My question is basically,
 will traffic be served with a non indexed searcher at any point?

 Thanks,
 Mike


 On Mon, Aug 29, 2011 at 4:45 PM, Rob Casson rob.cas...@gmail.com wrote:

 it's always been my understanding that the caches are discarded, then
 rebuilt/warmed:


 http://wiki.apache.org/solr/SolrCaching#Caching_and_Distribution.2BAC8-Replication

 hth,
 rob

 On Mon, Aug 29, 2011 at 5:30 PM, Mike Austin mike.aus...@juggle.com
 wrote:
  How does warming work when a collection is being distributed to a slave.
  I
  understand that a temp directory is created and it is eventually copied
 to
  the live folder, but what happens to the cache that was built in with
 the
  old index?  Does the cache get rebuilt, can we warm it before it becomes
  live, or can we keep the old cache?
 
  Thanks,
  Mike
 






Re: Solr Faceting DIH

2011-08-30 Thread Erick Erickson
I'd really think carefully before disabling unique IDs. If you do,
you'll have to manage the records yourself, so your next
delta-import will add more records to your search result, even
those that have been updated.

You might do something like make the uniqueKey the
concatenation of productid and attributeid or whatever
makes sense.

Best
Erick

On Mon, Aug 29, 2011 at 5:52 PM, Aaron Bains aaronba...@gmail.com wrote:
 Hello,

 I am trying to setup Solr Faceting on products by using the
 DataImportHandler to import data from my database. I have setup my
 data-config.xml with the proper queries and schema.xml with the fields.
 After the import/index is complete I can only search one productid record in
 Solr. For example of the three productid '10100039' records there are I am
 only able to search for one of those. Should I somehow disable unique ids?
 What is the best way of doing this?

 Below is the schema I am trying to index:

 +---+-+-++
 | productid | attributeid | valueid | categoryid |
 +---+-+-++
 |  10100039 |      331100 |    1580 |      1 |
 |  10100039 |      331694 |    1581 |      1 |
 |  10100039 |    33113319 | 1537370 |      1 |
 |  10100040 |      331100 |    1580 |      1 |
 |  10100040 |      331694 | 1540230 |      1 |
 |  10100040 |    33113319 | 1537370 |      1 |
 +---+-+-++

 Thanks!



Re: Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Marc Jacobs
Hi all,

Currently I'm testing Solr's indexing performance, but unfortunately I'm
running into memory problems.
It looks like Solr is not closing the filestream after an exception, but I'm
not really sure.

The current system I'm using has 150GB of memory and while I'm indexing the
memoryconsumption is growing and growing (eventually more then 50GB).
In the attached graph (http://postimage.org/image/acyv7kec/) I indexed about
70k of office-documents (pdf,doc,xls etc) and between 1 and 2 percent throws
an exception.
The commits are after 64MB, 60 seconds or after a job (there are 6 evenly
divided jobs).

After indexing the memoryconsumption isn't dropping. Even after an optimize
command it's still there.
What am I doing wrong? I can't imagine I'm the only one with this problem.
Thanks in advance!

Kind regards,

Marc


Re: Shingle and Query Performance

2011-08-30 Thread Lord Khan Han
Hi Eric,

Fields are lazy loading, content stored in solr and machine 32 gig.. solr
has 20 gig heap. There is no swapping.

As you see we have many phrases in the same query . I couldnt find a way to
drop qtime to subsecends. Suprisingly non shingled test better qtime !


On Mon, Aug 29, 2011 at 3:10 PM, Erick Erickson erickerick...@gmail.comwrote:

 Oh, one other thing: have you profiled your machine
 to see if you're swapping? How much memory are
 you giving your JVM? What is the underlying
 hardware setup?

 Best
 Erick

 On Mon, Aug 29, 2011 at 8:09 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  200K docs and 36G index? It sounds like you're storing
  your documents in the Solr index. In and of itself, that
  shouldn't hurt your query times, *unless* you have
  lazy field loading turned off, have you checked that
  lazy field loading is enabled?
 
 
 
  Best
  Erick
 
  On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han khanuniver...@gmail.com
 wrote:
  Another insteresting thing is : all one word or more word queries
 including
  phrase queries such as barack obama  slower in shingle configuration.
 What
  i am doing wrong ? without shingle barack obama Querytime 300ms  with
  shingle  780 ms..
 
 
  On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han khanuniver...@gmail.com
 wrote:
 
  Hi,
 
  What is the difference between solr 3.3  and the trunk ?
  I will try 3.3  and let you know the results.
 
 
  Here the search handler:
 
  requestHandler name=search class=solr.SearchHandler
 default=true
   lst name=defaults
 str name=echoParamsexplicit/str
 int name=rows10/int
 !--str name=fqcategory:vv/str--
   str name=fqmrank:[0 TO 100]/str
 str name=echoParamsexplicit/str
 int name=rows10/int
   str name=defTypeedismax/str
 !--str name=qftitle^0.05 url^1.2 content^1.7
  m_title^10.0/str--
  str name=qftitle^1.05 url^1.2 content^1.7 m_title^10.0/str
   !-- str name=bfrecip(ee_score,-0.85,1,0.2)/str --
   str name=pfcontent^18.0 m_title^5.0/str
   int name=ps1/int
   int name=qs0/int
   str name=mm2lt;-25%/str
   str name=spellchecktrue/str
   !--str name=spellcheck.collatetrue/str   --
  str name=spellcheck.count5/str
   str name=spellcheck.dictionarysubobjective/str
  str name=spellcheck.onlyMorePopularfalse/str
str name=hl.tag.prelt;bgt;/str
  str name=hl.tag.postlt;/bgt;/str
   str name=hl.useFastVectorHighlightertrue/str
   /lst
 
 
 
 
  On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:
 
  I'm not sure what the issue could be at this point.   I see you've got
  qt=search - what's the definition of that request handler?
 
  What is the parsed query (from the debugQuery response)?
 
  Have you tried this with Solr 3.3 to see if there's any appreciable
  difference?
 
 Erik
 
  On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
 
   When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
   increasing the query time and make useless to cache. But same config
  faster
   without shingle still.
  
   We have and head to head test this wednesday tihs commercial search
  engine.
   So I am looking for all suggestions.
  
  
  
   On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher 
 erik.hatc...@gmail.com
  wrote:
  
   Please confirm is this is caused by grouping.  Turn grouping off,
  what's
   query time like?
  
  
   On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
  
   On the other hand We couldnt use the cache for below types
 queries. I
   think
   its caused from grouping. Anyway we need to be sub second without
  cache.
  
  
  
   On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han 
  khanuniver...@gmail.com
   wrote:
  
   Hi,
  
   Thanks for the reply.
  
   Here the solr log capture.:
  
   **
  
  
  
 
 hl.fragsize=100spellcheck=truespellcheck.q=Xgroup.limit=5hl.simple.pre=bhl.fl=contentspellcheck.collate=truewt=javabinhl=truerows=20version=2fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,categoryhl.snippets=3start=0q=%2B+-X+-X+-XX+-XX+-XX+-+-XX+-XXX+-X+-+-+-X+-X+-X+-+-+-X+-XX+-X+-XX+-XX+-+-X+-XX+-+-X+-X+-X+-X+-X+-X+-X+-X+-XX+-XX+-XX+-X+-X+X+X+XX++group.field=hosthl.simple.post=/bgroup=trueqt=searchfq=mrank:[0+TO+100]fq=word_count:[70+TO+*]
   **
  
    is the words. All phrases x  has two words inside.
  
   The timing from the DebugQuery:
  
   lst name=timing
   double name=time8654.0/double
   lst name=prepare
   double name=time16.0/double
   lst name=org.apache.solr.handler.component.QueryComponent
   double name=time16.0/double
   /lst
   lst name=org.apache.solr.handler.component.FacetComponent
   double name=time0.0/double
   /lst
   lst
 name=org.apache.solr.handler.component.MoreLikeThisComponent
   double name=time0.0/double
   /lst
   lst 

Re: How to send an OpenBitSet object from Solr server?

2011-08-30 Thread Federico Fissore

Satish Talim, il 30/08/2011 05:42, ha scritto:
[...]


Is there a work-around wherein I can send an OpenBitSet object?



JavaBinCodec (used by default by solr) supports writing arrays. you can 
getBits() from openbitset and throw them into the binary response


federico


Re: How to send an OpenBitSet object from Solr server?

2011-08-30 Thread Federico Fissore

Satish Talim, il 30/08/2011 05:42, ha scritto:
[...]


Is there a work-around wherein I can send an OpenBitSet object?



JavaBinCodec (used by default by solr) supports writing arrays. you can 
getBits() from openbitset and throw them into the binary response


federico


Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread samuele.mattiuzzo
my problem is i still don't understand where i have to put that singleton (or
how i can load it into solr)

i have my singleton class Connector for mysql, with all its methods defined.
Now what? This is the point i'm missing :(

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3295320.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to send an OpenBitSet object from Solr server?

2011-08-30 Thread Satish Talim
But how to throw? As a stream of bits?

Satish

On Tue, Aug 30, 2011 at 5:39 PM, Federico Fissore feder...@fissore.orgwrote:

 Satish Talim, il 30/08/2011 05:42, ha scritto:
 [...]


 Is there a work-around wherein I can send an OpenBitSet object?


 JavaBinCodec (used by default by solr) supports writing arrays. you can
 getBits() from openbitset and throw them into the binary response

 federico



Re: How to send an OpenBitSet object from Solr server?

2011-08-30 Thread Federico Fissore

Satish Talim, il 30/08/2011 14:22, ha scritto:

But how to throw? As a stream of bits?



getBits() return a long[]
add a long[] part to your response

rb.rsp.add(long_array, obs.getBits())

federico


strField

2011-08-30 Thread Twomey, David

I have a string fieldtype defined as so

fieldType name=string class=solr.StrField sortMissingLast=true 
omitNorms=true/

And I have a field defined as

field name=guid type=string indexed=true stored=true required=false 
/

The fields are of this format
92E8EF8FC9F362BBE0408CA5785A29D4

But in the index they are like this:
str name=guid[B@520ed128/str

I thought it must be compression but compression=true|false is no longer 
supported by strField
I don't see any base64 encoding in this field.

Anyone shed light on this?

Thanks



Re: Post Processing Solr Results

2011-08-30 Thread Jamie Johnson
This might work in conjunction with what POST processing to help to
pair down the results, but the logic for the actual access to the data
is too complex to have entirely in solr.

On Mon, Aug 29, 2011 at 2:02 PM, Erick Erickson erickerick...@gmail.com wrote:
 It's reasonable, but post-filtering is often difficult, you have
 too many documents to wade through. If you can see any way
 at all to just include a clause in the query, you'll save a world
 of effort...

 Is there any way you can include a value in some kind of
 permissions field? Let's say you have a document that
 is only to be visible for tier 1 customers. If your permissions
 field contained the tiers (e.g. tier0, tier1), then a simple
 AND permissions:tier1 would do the trick...

 I know this is a trivial example, but you see where this is headed.
 The documents can contain as many of these tokens in permissions
 as you want. As long as you can string together a clause
 like AND permissions:(A OR B OR C) and not have the clause
 get ridiculously long (as in thousands of values), that works best.

 Any such scheme depends upon being able to assign the documents
 some kind of code that doesn't change too often (because when it does
 you have to re-index) and figure out, at query time, what permissions
 a user has.

 Using FieldCache or low-level Lucene routines can answer the question
 Does doc X contain token Y in field Z reasonably easily. What it has
 a hard time doing is answering For document X, what are all the value
 in the inverted index in field Z.

 If this doesn't make sense, could you explain a bit more about your
 permissions model?

 Hope this helps
 Erick

 On Mon, Aug 29, 2011 at 11:46 AM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks guys, perhaps I am just going about this the wrong way.  So let
 me explain my problem and perhaps there is a more appropriate
 solution.  What I need to do is basically hide certain results based
 on some passed in user parameter (say their service tier for
 instance).  What I'd like to do is have some way to plugin my custom
 logic to basically remove certain documents from the result set using
 this information.  Now that being said I technically don't need to
 remove the documents from the full result set, I really only need to
 remove them from current page (but still ensuring that a page is
 filled and sorted).  At present I'm trying to see if there is a way
 for me to add this type of logic after the QueryComponent has
 executed, perhaps by going through the DocIdandSet at this point and
 then intersecting the DocIdSet with a DocIdSet which would filter out
 the stuff I don't want seen.  Does this sound reasonable or like a
 fools errand?



 On Mon, Aug 29, 2011 at 10:51 AM, Erik Hatcher erik.hatc...@gmail.com 
 wrote:
 I haven't followed the details, but what I'm guessing you want here is 
 Lucene's FieldCache.  Perhaps something along the lines of how faceting 
 uses it (in SimpleFacets.java) -

   FieldCache.DocTermsIndex si = 
 FieldCache.DEFAULT.getTermsIndex(searcher.getIndexReader(), fieldName);

        Erik

 On Aug 29, 2011, at 09:58 , Erick Erickson wrote:

 If you're asking whether there's a way to find, say,
 all the values for the auth field associated with
 a document... no. The nature of an inverted
 index makes this hard (think of finding all
 the definitions in a dictionary where the word
 earth was in the definition).

 Best
 Erick

 On Mon, Aug 29, 2011 at 9:21 AM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks Erick, if I did not know the token up front that could be in
 the index is there not an efficient way to get the field for a
 specific document and do some custom processing on it?

 On Mon, Aug 29, 2011 at 8:34 AM, Erick Erickson erickerick...@gmail.com 
 wrote:
 Start here I think:

 http://lucene.apache.org/java/3_0_2/api/core/index.html?org/apache/lucene/index/TermDocs.html

 Best
 Erick

 On Mon, Aug 29, 2011 at 8:24 AM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks for the reply.  The fields I want are indexed, but how would I
 go directly at the fields I wanted?

 In regards to indexing the auth tokens I've thought about this and am
 trying to get confirmation if that is reasonable given our
 constraints.

 On Mon, Aug 29, 2011 at 8:20 AM, Erick Erickson 
 erickerick...@gmail.com wrote:
 Yeah, loading the document inside a Collector is a
 definite no-no. Have you tried going directly
 at the fields you want (assuming they're
 indexed)? That *should* be much faster, but
 whether it'll be fast enough is a good question. I'm
 thinking some of the Terms methods here. You
 *might* get some joy out of making sure lazy
 field loading is enabled (and make sure the
 fields you're accessing for your logic are
 indexed), but I'm not entirely sure about
 that bit.

 This kind of problem is sometimes handled
 by indexing auth tokens with the documents
 and including an OR clause on the query
 with the authorizations for a particular
 user, but that works best if 

Re: Does Solr flush to disk even before ramBufferSizeMB is hit?

2011-08-30 Thread Shawn Heisey

On 8/30/2011 12:57 AM, roz dev wrote:

Thanks Shawn.

If Solr writes this info to Disk as soon as possible (which is what I am
seeing) then ramBuffer setting seems to be misleading.

Anyone else has any thoughts on this?


The stored fields are only two of the eleven Lucene files in each 
segment.  The buffer is not needed for them, because there is no 
transformation or data aggregation, they are written continuously as 
data is read.  The other files have to utilize the buffer, and can only 
be written once all the data for that segment has been read, 
transformed, and aggregated.


Thanks,
Shawn



Reading results from FieldCollapsing

2011-08-30 Thread Sowmya V.B.
Hi All

I am trying to use FieldCollapsing feature in Solr. On the Solr admin
interface, I give ...group=truegroup.field=fieldA and I can see grouped
results.
But, I am not able to figure out how to read those results in that order on
java.

Something like: SolrDocumentList doclist = response.getResults();
gives me a set of results, on which I iterate, and get something like
doclist.get(1).getFieldValue(title) etc.

After grouping, doing the same step throws me error (apparently, because the
returned xml formats are different too).

How can I read groupValues and thereby other fieldvalues of the documents
inside that group?

S.
-- 
Sowmya V.B.

Losing optimism is blasphemy!
http://vbsowmya.wordpress.com



Context-Sensitive Spelling Suggestions Collations

2011-08-30 Thread O. Klein
Using the DirectSolrSpellChecker im very interested in this.

According to https://issues.apache.org/jira/browse/SOLR-2585 some changes
need to be made to DirectSolrSpellChecker.

Does anybody know how to get this working?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Context-Sensitive-Spelling-Suggestions-Collations-tp3295570p3295570.html
Sent from the Solr - User mailing list archive at Nabble.com.


escaping special characters does not seem to be escaping in query

2011-08-30 Thread ramdev.wudali
Hi All:
I have a few fields that are of the form: A:2B or G:U2 and so on.  I 
would like to be able to search the field using a wild character search like:   
A:2*
or G:U*. I have tried out modifying the field_type definitions to allow for 
such queries but without any luck

Could someone/anyone provided me with a fieldtype that uses the canned 
Tokenizers and filters which will allow me to do a search as described ?

Thanks much

Ramdev 



Re: Solr Geodist

2011-08-30 Thread solrnovice
hi Eric, thank you for the tip, i will try that option. Where can i find a
document that shows details of geodist arguments, when i google, i did not
find one. 
so this is what my query is like. I want the distance to be returned. i dont
now exactly what all to pass to geodist, as i couldnt find a proper
document. 

http://localhost:/solr/apex_dev/select/?q=city:Quincyfl=city,state,coordinates,score,geodist(39.9435,-120.9226).

So i want to pass in the long and lat of Quincy and then i want all the
records which are tagged with Quincy should be returned ( as i am doing a
q=city:Qunicy search)  and also distance to be displayed. Can you please let
me know what i should pass into goedist(), in this scenario. 


thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3295606.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Faceting DIH

2011-08-30 Thread Alexei Martchenko
I had the same problem with a database here, and we discovered that every
item had its own product page, its own url. So, we decided that our unique
id had to be the url instead of using sql ids and id concatenations.
sometimes it works. You can store all ids if u need them for something, but
for uniqueids, urls go just fine.

2011/8/30 Erick Erickson erickerick...@gmail.com

 I'd really think carefully before disabling unique IDs. If you do,
 you'll have to manage the records yourself, so your next
 delta-import will add more records to your search result, even
 those that have been updated.

 You might do something like make the uniqueKey the
 concatenation of productid and attributeid or whatever
 makes sense.

 Best
 Erick

 On Mon, Aug 29, 2011 at 5:52 PM, Aaron Bains aaronba...@gmail.com wrote:
  Hello,
 
  I am trying to setup Solr Faceting on products by using the
  DataImportHandler to import data from my database. I have setup my
  data-config.xml with the proper queries and schema.xml with the fields.
  After the import/index is complete I can only search one productid record
 in
  Solr. For example of the three productid '10100039' records there are I
 am
  only able to search for one of those. Should I somehow disable unique
 ids?
  What is the best way of doing this?
 
  Below is the schema I am trying to index:
 
  +---+-+-++
  | productid | attributeid | valueid | categoryid |
  +---+-+-++
  |  10100039 |  331100 |1580 |  1 |
  |  10100039 |  331694 |1581 |  1 |
  |  10100039 |33113319 | 1537370 |  1 |
  |  10100040 |  331100 |1580 |  1 |
  |  10100040 |  331694 | 1540230 |  1 |
  |  10100040 |33113319 | 1537370 |  1 |
  +---+-+-++
 
  Thanks!
 




-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


Re: Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Erick Erickson
What version of Solr are you using, and how are you indexing?
DIH? SolrJ?

I'm guessing you're using Tika, but how?

Best
Erick

On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs jacob...@gmail.com wrote:
 Hi all,

 Currently I'm testing Solr's indexing performance, but unfortunately I'm
 running into memory problems.
 It looks like Solr is not closing the filestream after an exception, but I'm
 not really sure.

 The current system I'm using has 150GB of memory and while I'm indexing the
 memoryconsumption is growing and growing (eventually more then 50GB).
 In the attached graph I indexed about 70k of office-documents (pdf,doc,xls
 etc) and between 1 and 2 percent throws an exception.
 The commits are after 64MB, 60 seconds or after a job (there are 6 evenly
 divided jobs).

 After indexing the memoryconsumption isn't dropping. Even after an optimize
 command it's still there.
 What am I doing wrong? I can't imagine I'm the only one with this problem.
 Thanks in advance!

 Kind regards,

 Marc



Re: Shingle and Query Performance

2011-08-30 Thread Erick Erickson
Can we see the output if you specify both
debugQuery=ondebug=true

the debug=true will show the time taken up with various
components, which is sometimes surprising...

Second, we never asked the most basic question, what are
you measuring? Is this the QTime of the returned response?
(which is the time actually spent searching) or the time until
the response gets back to the client, which may involve lots besides
searching...

Best
Erick

On Tue, Aug 30, 2011 at 7:59 AM, Lord Khan Han khanuniver...@gmail.com wrote:
 Hi Eric,

 Fields are lazy loading, content stored in solr and machine 32 gig.. solr
 has 20 gig heap. There is no swapping.

 As you see we have many phrases in the same query . I couldnt find a way to
 drop qtime to subsecends. Suprisingly non shingled test better qtime !


 On Mon, Aug 29, 2011 at 3:10 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Oh, one other thing: have you profiled your machine
 to see if you're swapping? How much memory are
 you giving your JVM? What is the underlying
 hardware setup?

 Best
 Erick

 On Mon, Aug 29, 2011 at 8:09 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  200K docs and 36G index? It sounds like you're storing
  your documents in the Solr index. In and of itself, that
  shouldn't hurt your query times, *unless* you have
  lazy field loading turned off, have you checked that
  lazy field loading is enabled?
 
 
 
  Best
  Erick
 
  On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han khanuniver...@gmail.com
 wrote:
  Another insteresting thing is : all one word or more word queries
 including
  phrase queries such as barack obama  slower in shingle configuration.
 What
  i am doing wrong ? without shingle barack obama Querytime 300ms  with
  shingle  780 ms..
 
 
  On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han khanuniver...@gmail.com
 wrote:
 
  Hi,
 
  What is the difference between solr 3.3  and the trunk ?
  I will try 3.3  and let you know the results.
 
 
  Here the search handler:
 
  requestHandler name=search class=solr.SearchHandler
 default=true
       lst name=defaults
         str name=echoParamsexplicit/str
         int name=rows10/int
         !--str name=fqcategory:vv/str--
   str name=fqmrank:[0 TO 100]/str
         str name=echoParamsexplicit/str
         int name=rows10/int
   str name=defTypeedismax/str
         !--str name=qftitle^0.05 url^1.2 content^1.7
  m_title^10.0/str--
  str name=qftitle^1.05 url^1.2 content^1.7 m_title^10.0/str
   !-- str name=bfrecip(ee_score,-0.85,1,0.2)/str --
   str name=pfcontent^18.0 m_title^5.0/str
   int name=ps1/int
   int name=qs0/int
   str name=mm2lt;-25%/str
   str name=spellchecktrue/str
   !--str name=spellcheck.collatetrue/str   --
  str name=spellcheck.count5/str
   str name=spellcheck.dictionarysubobjective/str
  str name=spellcheck.onlyMorePopularfalse/str
    str name=hl.tag.prelt;bgt;/str
  str name=hl.tag.postlt;/bgt;/str
   str name=hl.useFastVectorHighlightertrue/str
       /lst
 
 
 
 
  On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:
 
  I'm not sure what the issue could be at this point.   I see you've got
  qt=search - what's the definition of that request handler?
 
  What is the parsed query (from the debugQuery response)?
 
  Have you tried this with Solr 3.3 to see if there's any appreciable
  difference?
 
         Erik
 
  On Aug 27, 2011, at 09:34 , Lord Khan Han wrote:
 
   When grouping off the query time ie 3567 ms  to 1912 ms . Grouping
   increasing the query time and make useless to cache. But same config
  faster
   without shingle still.
  
   We have and head to head test this wednesday tihs commercial search
  engine.
   So I am looking for all suggestions.
  
  
  
   On Sat, Aug 27, 2011 at 3:37 PM, Erik Hatcher 
 erik.hatc...@gmail.com
  wrote:
  
   Please confirm is this is caused by grouping.  Turn grouping off,
  what's
   query time like?
  
  
   On Aug 27, 2011, at 07:27 , Lord Khan Han wrote:
  
   On the other hand We couldnt use the cache for below types
 queries. I
   think
   its caused from grouping. Anyway we need to be sub second without
  cache.
  
  
  
   On Sat, Aug 27, 2011 at 2:18 PM, Lord Khan Han 
  khanuniver...@gmail.com
   wrote:
  
   Hi,
  
   Thanks for the reply.
  
   Here the solr log capture.:
  
   **
  
  
  
 
 hl.fragsize=100spellcheck=truespellcheck.q=Xgroup.limit=5hl.simple.pre=bhl.fl=contentspellcheck.collate=truewt=javabinhl=truerows=20version=2fl=score,approved,domain,host,id,lang,mimetype,title,tstamp,url,categoryhl.snippets=3start=0q=%2B+-X+-X+-XX+-XX+-XX+-+-XX+-XXX+-X+-+-+-X+-X+-X+-+-+-X+-XX+-X+-XX+-XX+-+-X+-XX+-+-X+-X+-X+-X+-X+-X+-X+-X+-XX+-XX+-XX+-X+-X+X+X+XX++group.field=hosthl.simple.post=/bgroup=trueqt=searchfq=mrank:[0+TO+100]fq=word_count:[70+TO+*]
   **
  
    is the words. All phrases x  

Re: strField

2011-08-30 Thread Erik Hatcher
My educated guess is that you're using Java for your indexer, and you're (or 
something below is) doing a toString on a Java object.  You're sending over a 
Java object address, not the string itself.  A simple change to your indexer 
should fix this.

Erik

On Aug 30, 2011, at 08:42 , Twomey, David wrote:

 
 I have a string fieldtype defined as so
 
 fieldType name=string class=solr.StrField sortMissingLast=true 
 omitNorms=true/
 
 And I have a field defined as
 
 field name=guid type=string indexed=true stored=true 
 required=false /
 
 The fields are of this format
 92E8EF8FC9F362BBE0408CA5785A29D4
 
 But in the index they are like this:
 str name=guid[B@520ed128/str
 
 I thought it must be compression but compression=true|false is no longer 
 supported by strField
 I don't see any base64 encoding in this field.
 
 Anyone shed light on this?
 
 Thanks
 



Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread Erick Erickson
OK, maybe I'm getting there. You put it into a .jar file, and
then in solrconfig.xml you create a lib... directive that points
to where the jar file is. At that point, you can add your custom
class to the UpdateRequestProcessor as per Tomas' e-mail.

Best
Erick

On Tue, Aug 30, 2011 at 8:10 AM, samuele.mattiuzzo samum...@gmail.com wrote:
 my problem is i still don't understand where i have to put that singleton (or
 how i can load it into solr)

 i have my singleton class Connector for mysql, with all its methods defined.
 Now what? This is the point i'm missing :(

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3295320.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr UIMA exception

2011-08-30 Thread chanhangfai
thanks Tommaso,

there is some problem in my solrconfig.xml.
now its fixed.

thanks again.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-UIMA-exception-tp3285158p3295743.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reading results from FieldCollapsing

2011-08-30 Thread Erick Erickson
Have you looked at the XML (or JSON) response format?
You're right, it is different and you have to parse it
differently, there are move levels. Try this query
and you'll see the format (default data set).

http://localhost:8983/solr/select?q=*:*group=ongroup.field=manu_exact


Best
Erick

On Tue, Aug 30, 2011 at 9:25 AM, Sowmya V.B. vbsow...@gmail.com wrote:
 Hi All

 I am trying to use FieldCollapsing feature in Solr. On the Solr admin
 interface, I give ...group=truegroup.field=fieldA and I can see grouped
 results.
 But, I am not able to figure out how to read those results in that order on
 java.

 Something like: SolrDocumentList doclist = response.getResults();
 gives me a set of results, on which I iterate, and get something like
 doclist.get(1).getFieldValue(title) etc.

 After grouping, doing the same step throws me error (apparently, because the
 returned xml formats are different too).

 How can I read groupValues and thereby other fieldvalues of the documents
 inside that group?

 S.
 --
 Sowmya V.B.
 
 Losing optimism is blasphemy!
 http://vbsowmya.wordpress.com
 



Re: escaping special characters does not seem to be escaping in query

2011-08-30 Thread Erick Erickson
There is very little information to go on here, but at a
guess WordDelimiterFilterFactory is your problem.
have you looked at the admin/analysis page to try to figure
out what your analysis chain is doing?

Best
Erick


On Tue, Aug 30, 2011 at 9:46 AM,  ramdev.wud...@thomsonreuters.com wrote:
 Hi All:
    I have a few fields that are of the form: A:2B or G:U2 and so on.  I 
 would like to be able to search the field using a wild character search like: 
   A:2*
 or G:U*. I have tried out modifying the field_type definitions to allow for 
 such queries but without any luck

 Could someone/anyone provided me with a fieldtype that uses the canned 
 Tokenizers and filters which will allow me to do a search as described ?

 Thanks much

 Ramdev




Re: Solr Geodist

2011-08-30 Thread Erick Erickson
q=*:*sfield=storept=45.15,-93.85fl=name,store,geodist()

Actually, you don't even have to specify the d=, I misunderstood.

Best
Erick

On Tue, Aug 30, 2011 at 9:56 AM, solrnovice manisha...@yahoo.com wrote:
 hi Eric, thank you for the tip, i will try that option. Where can i find a
 document that shows details of geodist arguments, when i google, i did not
 find one.
 so this is what my query is like. I want the distance to be returned. i dont
 now exactly what all to pass to geodist, as i couldnt find a proper
 document.

 http://localhost:/solr/apex_dev/select/?q=city:Quincyfl=city,state,coordinates,score,geodist(39.9435,-120.9226).

 So i want to pass in the long and lat of Quincy and then i want all the
 records which are tagged with Quincy should be returned ( as i am doing a
 q=city:Qunicy search)  and also distance to be displayed. Can you please let
 me know what i should pass into goedist(), in this scenario.


 thanks
 SN

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3295606.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to get update record from database using delta-query?

2011-08-30 Thread Erick Erickson
Have you tried the debug page? See:

http://wiki.apache.org/solr/DataImportHandler#interactive

Best
Erick

On Tue, Aug 30, 2011 at 12:44 AM, vighnesh svighnesh...@gmail.com wrote:
 hi all

 I am facing the problem in get a update record from database using delta
 query in solr please give me the solution and my delta query is

   entity name=groups_copy pk=id dataSource=datasource-1
    query=select id,name from groups_copy ;
    deltaQuery=select id,name from groups_copy where
 date_created'${dataimporter.last_index_time}'
    deltaImportQuery=select id,name from groups_copy where
 id='${dataimporter.delta.id}' ;
   field column=id name=id /
   field column=name name=name /
   /entity


 is there any wrong in this code please let me know

 thanks in advance.

 Regards,
 Vighnesh.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/how-to-get-update-record-from-database-using-delta-query-tp3294510p3294510.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: strField

2011-08-30 Thread Twomey, David
Hmmm, I'm using DIH defined in data-config.xml

I have an Oracle data source configured using JDBC connect string.




On 8/30/11 10:41 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

My educated guess is that you're using Java for your indexer, and you're
(or something below is) doing a toString on a Java object.  You're
sending over a Java object address, not the string itself.  A simple
change to your indexer should fix this.

Erik

On Aug 30, 2011, at 08:42 , Twomey, David wrote:

 
 I have a string fieldtype defined as so
 
 fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true/
 
 And I have a field defined as
 
 field name=guid type=string indexed=true stored=true
required=false /
 
 The fields are of this format
 92E8EF8FC9F362BBE0408CA5785A29D4
 
 But in the index they are like this:
 str name=guid[B@520ed128/str
 
 I thought it must be compression but compression=true|false is no
longer supported by strField
 I don't see any base64 encoding in this field.
 
 Anyone shed light on this?
 
 Thanks
 




Relative performance of updating documents of different sizes

2011-08-30 Thread Jeff Leedy
I was curious to know if anyone has any information about the relative
performance of document updates (delete/add operations) on documents
of different sizes. I have a use case in which I can either create
large Solr documents first and subsequently add a small amount of
information to them, or do the opposite (add the small doc first, then
update with the big one.) My guess is that adding smaller ones first
will be faster, since the time to delete a small document is
presumably longer than the time to delete a small one.

Thanks,
Jeff


Re: Reading results from FieldCollapsing

2011-08-30 Thread Sowmya V.B.
Hi Erick

Yes, I did see the XML format. But, I did not understand how to read the
response using SolrJ.

I found some information about Collapse Component on googling, which looks
like a normal Solr XML results format.
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

However, this class CollapseComponent does not seem to exist in Solr
3.3. (org.apache.solr.handler.component.CollapseComponent)
was the component mentioned in that link, which is not there in Solr3.3
class files.

Sowmya.

On Tue, Aug 30, 2011 at 4:48 PM, Erick Erickson erickerick...@gmail.comwrote:

 Have you looked at the XML (or JSON) response format?
 You're right, it is different and you have to parse it
 differently, there are move levels. Try this query
 and you'll see the format (default data set).

 http://localhost:8983/solr/select?q=*:*group=ongroup.field=manu_exact


 Best
 Erick

 On Tue, Aug 30, 2011 at 9:25 AM, Sowmya V.B. vbsow...@gmail.com wrote:
  Hi All
 
  I am trying to use FieldCollapsing feature in Solr. On the Solr admin
  interface, I give ...group=truegroup.field=fieldA and I can see
 grouped
  results.
  But, I am not able to figure out how to read those results in that order
 on
  java.
 
  Something like: SolrDocumentList doclist = response.getResults();
  gives me a set of results, on which I iterate, and get something like
  doclist.get(1).getFieldValue(title) etc.
 
  After grouping, doing the same step throws me error (apparently, because
 the
  returned xml formats are different too).
 
  How can I read groupValues and thereby other fieldvalues of the documents
  inside that group?
 
  S.
  --
  Sowmya V.B.
  
  Losing optimism is blasphemy!
  http://vbsowmya.wordpress.com
  
 




-- 
Sowmya V.B.

Losing optimism is blasphemy!
http://vbsowmya.wordpress.com



Re: Relative performance of updating documents of different sizes

2011-08-30 Thread Markus Jelsma
Document size should not have any impact on deleting document as they are only 
marked for deletion. 

On Tuesday 30 August 2011 17:06:05 Jeff Leedy wrote:
 I was curious to know if anyone has any information about the relative
 performance of document updates (delete/add operations) on documents
 of different sizes. I have a use case in which I can either create
 large Solr documents first and subsequently add a small amount of
 information to them, or do the opposite (add the small doc first, then
 update with the big one.) My guess is that adding smaller ones first
 will be faster, since the time to delete a small document is
 presumably longer than the time to delete a small one.
 
 Thanks,
 Jeff

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread samuele.mattiuzzo
ok so my two singleton classes are MysqlConnector and JFJPConnector

basically:

1 - jar them
2 - cp them to /custom/path/within/solr/
3 - modify solrconfig.xml with lib/custom/path/within/solr//lib

my two jars are then automatically loaded? nice!

in my CustomUpdateProcessor class i can call MysqlConnector.start_query()
and JFJPConnector.other_method(), and it will refer to an active instance of
those 2 classes? Is this how it works, without any other trick around?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3295818.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reading results from FieldCollapsing

2011-08-30 Thread Erick Erickson
Ahhh, see: https://issues.apache.org/jira/browse/SOLR-2637

Short form: It's in 3.4, not 3.3.

So, your choices are:
1 parse the XML yourself
2 get a current 3x build (as in one of the nightlys) and use SolrJ there.

Best
Erick

On Tue, Aug 30, 2011 at 11:09 AM, Sowmya V.B. vbsow...@gmail.com wrote:
 Hi Erick

 Yes, I did see the XML format. But, I did not understand how to read the
 response using SolrJ.

 I found some information about Collapse Component on googling, which looks
 like a normal Solr XML results format.
 http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

 However, this class CollapseComponent does not seem to exist in Solr
 3.3. (org.apache.solr.handler.component.CollapseComponent)
 was the component mentioned in that link, which is not there in Solr3.3
 class files.

 Sowmya.

 On Tue, Aug 30, 2011 at 4:48 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Have you looked at the XML (or JSON) response format?
 You're right, it is different and you have to parse it
 differently, there are move levels. Try this query
 and you'll see the format (default data set).

 http://localhost:8983/solr/select?q=*:*group=ongroup.field=manu_exact


 Best
 Erick

 On Tue, Aug 30, 2011 at 9:25 AM, Sowmya V.B. vbsow...@gmail.com wrote:
  Hi All
 
  I am trying to use FieldCollapsing feature in Solr. On the Solr admin
  interface, I give ...group=truegroup.field=fieldA and I can see
 grouped
  results.
  But, I am not able to figure out how to read those results in that order
 on
  java.
 
  Something like: SolrDocumentList doclist = response.getResults();
  gives me a set of results, on which I iterate, and get something like
  doclist.get(1).getFieldValue(title) etc.
 
  After grouping, doing the same step throws me error (apparently, because
 the
  returned xml formats are different too).
 
  How can I read groupValues and thereby other fieldvalues of the documents
  inside that group?
 
  S.
  --
  Sowmya V.B.
  
  Losing optimism is blasphemy!
  http://vbsowmya.wordpress.com
  
 




 --
 Sowmya V.B.
 
 Losing optimism is blasphemy!
 http://vbsowmya.wordpress.com
 



Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread Erick Erickson
Right, you're on track. Note that the  changes you
make to solrconfig.xml require you to give the
qualified class name (e.g. org.myproj.myclass), but
it all just gets found man.

Also, it's not even necessary to be at a custom path within
Solr, although it does have to be *relative* to SOLR_HOME.
I often point it directly to the output directory that my IDE puts
artifacts in. Although the paths get weird, things like
../../../erick/project/out/blahblahblabh

Best
Erick

On Tue, Aug 30, 2011 at 11:14 AM, samuele.mattiuzzo samum...@gmail.com wrote:
 ok so my two singleton classes are MysqlConnector and JFJPConnector

 basically:

 1 - jar them
 2 - cp them to /custom/path/within/solr/
 3 - modify solrconfig.xml with lib/custom/path/within/solr//lib

 my two jars are then automatically loaded? nice!

 in my CustomUpdateProcessor class i can call MysqlConnector.start_query()
 and JFJPConnector.other_method(), and it will refer to an active instance of
 those 2 classes? Is this how it works, without any other trick around?

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3295818.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread samuele.mattiuzzo
i think it's better for me to keep it under some solr installation path, i
don't want to loose files :)

ok, i'm going to try this out :) i already got into the package issue
(my.package.whatever) this one i know how to handle!

thanks for all the help, i'll post again to tell you It Works! (but i'm
not sure about it!)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3295842.html
Sent from the Solr - User mailing list archive at Nabble.com.


Document Size for Indexing

2011-08-30 Thread Tirthankar Chatterjee
Hi,

I have a machine (win 2008R2) with 16GB RAM, I am having issue indexing 1/2GB 
files. How do we avoid creating a SOLRInputDocument or is there any way to 
directly use Lucene Index writer classes.

What would be the best approach. We need some suggestions.

Thanks,
Tirthankar


**Legal Disclaimer***
This communication may contain confidential and privileged
material for the sole use of the intended recipient. Any
unauthorized review, use or distribution by others is strictly
prohibited. If you have received the message in error, please
advise the sender by reply email and delete the message. Thank
you.
*

Re: Solr Geodist

2011-08-30 Thread solrnovice
Eric, thank you for the quick update, so in the below query you sent to me, i
can also add any conditions right? i mean city:Boston and state:MA...etc ,
can i also use dismax query syntax?  

The confusion from the beginning seems to be the version of solr i was
trying and the one you are trying. Looks like the latest trunc of solr has
the geodist and the one i am using is not returning geodist. 


q=*:*sfield=storept=45.15,-93.85fl=name,store,geodist() 



thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3295868.html
Sent from the Solr - User mailing list archive at Nabble.com.


add documents to the slave

2011-08-30 Thread Miguel Valencia

Hi

I've read that it's possible add documents to slave machine:

http://wiki.apache.org/solr/SolrReplication#What_if_I_add_documents_to_the_slave_or_if_slave_index_gets_corrupted.3F

¿Is there anyway to not allow add to documents to slave machine? for 
example, touch on configurations files to only allow handler /select.


Thanks.




Re: strField

2011-08-30 Thread Twomey, David
Ok.  Figured it out.  Thanks for the pointer.  The field was of type RAW
in Oracle so it was being converted to a java string by DIH with the
behaviour below.

I just changed the SQL query in DIH to add RAWTOHEX(guid)



On 8/30/11 11:03 AM, Twomey, David david.two...@novartis.com wrote:

Hmmm, I'm using DIH defined in data-config.xml

I have an Oracle data source configured using JDBC connect string.




On 8/30/11 10:41 AM, Erik Hatcher erik.hatc...@gmail.com wrote:

My educated guess is that you're using Java for your indexer, and you're
(or something below is) doing a toString on a Java object.  You're
sending over a Java object address, not the string itself.  A simple
change to your indexer should fix this.

Erik

On Aug 30, 2011, at 08:42 , Twomey, David wrote:

 
 I have a string fieldtype defined as so
 
 fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true/
 
 And I have a field defined as
 
 field name=guid type=string indexed=true stored=true
required=false /
 
 The fields are of this format
 92E8EF8FC9F362BBE0408CA5785A29D4
 
 But in the index they are like this:
 str name=guid[B@520ed128/str
 
 I thought it must be compression but compression=true|false is no
longer supported by strField
 I don't see any base64 encoding in this field.
 
 Anyone shed light on this?
 
 Thanks
 





Re: How to send an OpenBitSet object from Solr server?

2011-08-30 Thread Chris Hostetter

: We have a need to query and fetch millions of document ids from a Solr 3.3
: index and convert the same to a BitSet. To speed things up, we want to
: convert these document ids into OpenBitSet on the server side, put them into
: the response object and read the same on the client side.

This smells like an XY Problem ... what do you intend to do with this 
BitSet on the client side?  the lucene doc ids are meaningless outside of 
hte server, and for any given doc, the id could change from one request to 
the next -- so how would having this data on the clinet be of any use to 
you?

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss


Re: dependency injection in solr

2011-08-30 Thread Federico Fissore

Tomás Fernández Löbbe, il 29/08/2011 20:32, ha scritto:

You can use reflection to instantiate the correct object (specify the class
name on the parameter on the solrconfig and then invoke the constructor via
reflection). You'll have to manage the life-cycle of your object yourself.
If I understand your requirement, you probably have created a
SearchComponent that uses that retriever, right?




sorry for the delay: I was experimenting.

Raw reflections do not suffice: you can't specify a dependency required 
by your reflection-constructed object


I've ended up using spring this way (will cut some code for brevity)

I've enabled spring the usual way, adding a ContextLoaderListener in 
web.xml and configuring the spring xml or java configuration files (I do 
java configuration)


I've declared a spring bean named myComponentDeclaredInTheSpringConf, 
that is an extension of SearchComponent, with it's collaborators


I've created SpringAwareSearchComponent, that is a delegate of 
SearchComponent


public SpringAwareSearchComponent() {
 this.ctx = ContextLoader.getCurrentWebApplicationContext();
}
...
public void init(NamedList args) {
 super.init(args);
 inner = ctx.getBean(args.get(__beanname__).toString(), 
SearchComponent.class);

 inner.init(args);
}

public void prepare(ResponseBuilder rb) throws IOException {
 inner.prepare(rb);
}

public void process(ResponseBuilder rb) throws IOException {
 inner.process(rb);
}

In solrconfig.xml I've declared the search component as
searchComponent name=myComponent class=SpringAwareSearchComponent
 str name=__beanname__myComponentDeclaredInTheSpringConf/str
 ...other bean specific parameters
/searchComponent

and added myComponent to the list of search components

And it works like a charm. Maybe I can implement some other solr class 
delegate and add hooks between spring and solr as needed


any comment will be appreciated

best regards

federico


Re: Shingle and Query Performance

2011-08-30 Thread Lord Khan Han
Below the output of the debug. I am measuring pure solr qtime which show in
the Qtime field in solr xml.

arr name=parsed_filter_queries
strmrank:[0 TO 100]/str
/arr
lst name=timing
double name=time8584.0/double
lst name=prepare
double name=time12.0/double
lst name=org.apache.solr.handler.component.QueryComponent
double name=time12.0/double
/lst
lst name=org.apache.solr.handler.component.FacetComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.HighlightComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.StatsComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.SpellCheckComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.DebugComponent
double name=time0.0/double
/lst
/lst
lst name=process
double name=time8572.0/double
lst name=org.apache.solr.handler.component.QueryComponent
double name=time4480.0/double
/lst
lst name=org.apache.solr.handler.component.FacetComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.MoreLikeThisComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.HighlightComponent
double name=time41.0/double
/lst
lst name=org.apache.solr.handler.component.StatsComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.SpellCheckComponent
double name=time0.0/double
/lst
lst name=org.apache.solr.handler.component.DebugComponent
double name=time4051.0/double
/lst

On Tue, Aug 30, 2011 at 5:38 PM, Erick Erickson erickerick...@gmail.comwrote:

 Can we see the output if you specify both
 debugQuery=ondebug=true

 the debug=true will show the time taken up with various
 components, which is sometimes surprising...

 Second, we never asked the most basic question, what are
 you measuring? Is this the QTime of the returned response?
 (which is the time actually spent searching) or the time until
 the response gets back to the client, which may involve lots besides
 searching...

 Best
 Erick

 On Tue, Aug 30, 2011 at 7:59 AM, Lord Khan Han khanuniver...@gmail.com
 wrote:
  Hi Eric,
 
  Fields are lazy loading, content stored in solr and machine 32 gig.. solr
  has 20 gig heap. There is no swapping.
 
  As you see we have many phrases in the same query . I couldnt find a way
 to
  drop qtime to subsecends. Suprisingly non shingled test better qtime !
 
 
  On Mon, Aug 29, 2011 at 3:10 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Oh, one other thing: have you profiled your machine
  to see if you're swapping? How much memory are
  you giving your JVM? What is the underlying
  hardware setup?
 
  Best
  Erick
 
  On Mon, Aug 29, 2011 at 8:09 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
   200K docs and 36G index? It sounds like you're storing
   your documents in the Solr index. In and of itself, that
   shouldn't hurt your query times, *unless* you have
   lazy field loading turned off, have you checked that
   lazy field loading is enabled?
  
  
  
   Best
   Erick
  
   On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han 
 khanuniver...@gmail.com
  wrote:
   Another insteresting thing is : all one word or more word queries
  including
   phrase queries such as barack obama  slower in shingle
 configuration.
  What
   i am doing wrong ? without shingle barack obama Querytime 300ms
  with
   shingle  780 ms..
  
  
   On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han 
 khanuniver...@gmail.com
  wrote:
  
   Hi,
  
   What is the difference between solr 3.3  and the trunk ?
   I will try 3.3  and let you know the results.
  
  
   Here the search handler:
  
   requestHandler name=search class=solr.SearchHandler
  default=true
lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  !--str name=fqcategory:vv/str--
str name=fqmrank:[0 TO 100]/str
  str name=echoParamsexplicit/str
  int name=rows10/int
str name=defTypeedismax/str
  !--str name=qftitle^0.05 url^1.2 content^1.7
   m_title^10.0/str--
   str name=qftitle^1.05 url^1.2 content^1.7 m_title^10.0/str
!-- str name=bfrecip(ee_score,-0.85,1,0.2)/str --
str name=pfcontent^18.0 m_title^5.0/str
int name=ps1/int
int name=qs0/int
str name=mm2lt;-25%/str
str name=spellchecktrue/str
!--str name=spellcheck.collatetrue/str   --
   str name=spellcheck.count5/str
str name=spellcheck.dictionarysubobjective/str
   str name=spellcheck.onlyMorePopularfalse/str
 str name=hl.tag.prelt;bgt;/str
   str name=hl.tag.postlt;/bgt;/str
str name=hl.useFastVectorHighlightertrue/str
/lst
  
  
  
  
   On Sat, Aug 27, 2011 at 5:31 PM, Erik Hatcher 
 erik.hatc...@gmail.com
  wrote:
  
   I'm not sure what the issue could be at this point.   I see you've
 got
   qt=search - what's the definition of that request handler?
  
   What 

Re: add documents to the slave

2011-08-30 Thread simon
That's basically it.

remove all /update URLs from the slave config

On Tue, Aug 30, 2011 at 8:34 AM, Miguel Valencia 
miguel.valen...@juntadeandalucia.es wrote:

 Hi

I've read that it's possible add documents to slave machine:

 http://wiki.apache.org/solr/**SolrReplication#What_if_I_add_**
 documents_to_the_slave_or_if_**slave_index_gets_corrupted.3Fhttp://wiki.apache.org/solr/SolrReplication#What_if_I_add_documents_to_the_slave_or_if_slave_index_gets_corrupted.3F

 ¿Is there anyway to not allow add to documents to slave machine? for
 example, touch on configurations files to only allow handler /select.

 Thanks.





Re: Document Size for Indexing

2011-08-30 Thread simon
what issues exactly ?

are you using 32 bit Java ? That will restrict the JVM heap size to 2GB max.

-Simon

On Tue, Aug 30, 2011 at 11:26 AM, Tirthankar Chatterjee 
tchatter...@commvault.com wrote:

 Hi,

 I have a machine (win 2008R2) with 16GB RAM, I am having issue indexing
 1/2GB files. How do we avoid creating a SOLRInputDocument or is there any
 way to directly use Lucene Index writer classes.

 What would be the best approach. We need some suggestions.

 Thanks,
 Tirthankar


 **Legal Disclaimer***
 This communication may contain confidential and privileged
 material for the sole use of the intended recipient. Any
 unauthorized review, use or distribution by others is strictly
 prohibited. If you have received the message in error, please
 advise the sender by reply email and delete the message. Thank
 you.
 *


Re: Solr Geodist

2011-08-30 Thread solrnovice
Eric, can you please let me know the solr build, that you are using. I went
to this below site, but i want to use the same build, you are using, so i
can make sure the queries work.


http://wiki.apache.org/solr/FrontPage#solr_development


thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3296210.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Chris Hostetter

: The current system I'm using has 150GB of memory and while I'm indexing the
: memoryconsumption is growing and growing (eventually more then 50GB).
: In the attached graph (http://postimage.org/image/acyv7kec/) I indexed about
: 70k of office-documents (pdf,doc,xls etc) and between 1 and 2 percent throws

Unless i'm missunderstanding sometihng about your graph, only ~12GB of 
memory is used by applications on that machine.  About 60GB is in use by 
the filesystem cache.

The Filesystem cache is not memory being used by Solr, it's memory that is 
free and not in use by an application, so your OS is (wisely) using it to 
cache files from disk that you've recently accessed in case you need them 
again.  This is handy, and for max efficients (when keeping your index on 
disk) it's useful to make sure you allocate resources so that you have 
enough extra memory on your server that the entire index can be kept in 
the filesystem cache -- but the OS will happily free up that space for 
other apps that need it if they ask for more memory.

: After indexing the memoryconsumption isn't dropping. Even after an optimize
: command it's still there.

as for why your Used memory grows to ~12GB and doesn't decrease even 
after an optimize: that's the way the Java memory model works.  whe nyou 
run the JVM you specificy (either explicitly or implicitly via defaults) a 
min  max heap size for hte JVM to allocate for itself.  it starts out 
asking the OS for the min, and as it needs more it asks for more up to the 
max.  but (most JVM implementations i know of) don't give back ram to 
the OS if they don't need it anymore -- they keep it as free space in the 
heap for future object allocation.



-Hoss


Re: Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Marc Jacobs
Hi Erick,

I am using Solr 3.3.0, but with 1.4.1 the same problems.
The connector is a homemade program in the C# programming language and is
posting via http remote streaming (i.e.
http://localhost:8080/solr/update/extract?stream.file=/path/to/file.docliteral.id=1
)
I'm using Tika to extract the content (comes with the Solr Cell).

A possible problem is that the filestream needs to be closed, after
extracting, by the client application, but it seems that there is going
something wrong while getting a Tika-exception: the stream never leaves the
memory. At least that is my assumption.

What is the common way to extract content from officefiles (pdf, doc, rtf,
xls etc) and index them? To write a content extractor / validator yourself?
Or is it possible to do this with the Solr Cell without getting a huge
memory consumption? Please let me know. Thanks in advance.

Marc

2011/8/30 Erick Erickson erickerick...@gmail.com

 What version of Solr are you using, and how are you indexing?
 DIH? SolrJ?

 I'm guessing you're using Tika, but how?

 Best
 Erick

 On Tue, Aug 30, 2011 at 4:55 AM, Marc Jacobs jacob...@gmail.com wrote:
  Hi all,
 
  Currently I'm testing Solr's indexing performance, but unfortunately I'm
  running into memory problems.
  It looks like Solr is not closing the filestream after an exception, but
 I'm
  not really sure.
 
  The current system I'm using has 150GB of memory and while I'm indexing
 the
  memoryconsumption is growing and growing (eventually more then 50GB).
  In the attached graph I indexed about 70k of office-documents
 (pdf,doc,xls
  etc) and between 1 and 2 percent throws an exception.
  The commits are after 64MB, 60 seconds or after a job (there are 6 evenly
  divided jobs).
 
  After indexing the memoryconsumption isn't dropping. Even after an
 optimize
  command it's still there.
  What am I doing wrong? I can't imagine I'm the only one with this
 problem.
  Thanks in advance!
 
  Kind regards,
 
  Marc
 



Re: strField

2011-08-30 Thread Chris Hostetter
: Ok.  Figured it out.  Thanks for the pointer.  The field was of type RAW
: in Oracle so it was being converted to a java string by DIH with the
: behaviour below.

RAW is probably very similar to BLOB...

https://wiki.apache.org/solr/DataImportHandlerFaq#Blob_values_in_my_table_are_added_to_the_Solr_document_as_object_strings_like_B.401f23c5


-Hoss


Re: Viewing the complete document from within the index

2011-08-30 Thread karthik
Thanks Everyone for the responses.

Yes, the way Eric described would work for trivial debugging but when i
actually need to debug something in production this would be a big hassle
;-)

For now I am going to mark the field to be stored=true to get around this
problem. We are migrating away from FAST and FAST has a feature where it can
dump the entire documents content from the index to a txt file.

Thanks again.

On Mon, Aug 29, 2011 at 8:27 AM, Erick Erickson erickerick...@gmail.comwrote:

 You can use Luke to re-construct the doc from
 the indexed terms. It takes a while, because it's
 not a trivial problem, so I'd use a small index for
 verification first If you have Luke show
 you the doc, it'll return stored fields, but as I remember
 there's a button like reconstruct and edit that does
 what you want...

 You can use the TermsComponent to see what's in
 the inverted part of the index, but it doesn't tell
 you which document is associated with the terms,
 so might not help much.

 But it seems you could do this empirically by
 controlling the input to a small set of docs and then
 querying on terms you *know* you didn't have in
 the input but were in the synonyms

 Best
 Erick

 On Mon, Aug 29, 2011 at 3:55 AM, pravesh suyalprav...@yahoo.com wrote:
  Reconstructing the document might not be possible, since,only the stored
  fields are actually stored document-wise(un-inverted), where as the
  indexed-only fields are put as inverted way.
  In don't think SOLR/Lucene currently provides any way, so, one can
  re-construct document in the way you desire. (It's sort of reverse
  engineering not supported)
 
  Thanx
  Pravesh
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Viewing-the-complete-document-from-within-the-index-tp3288076p3292111.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



Re: Solr Geodist

2011-08-30 Thread solrnovice
I think i found the link to the nightly build, i am going to try this flavor
of solr and run the query and check what happens.
The link i am using is 
https://builds.apache.org/job/Solr-trunk/lastSuccessfulBuild/artifact/artifacts/

thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3296316.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search by range in multivalued fields

2011-08-30 Thread Chris Hostetter


if you remove the single quotes from your query syntax it should work.

in general using multivalued fields where you want to coordinate matches 
based on the position in the multivalued field (ie: a multivalued list of 
author first names and a multivalued lsit of author lastnames and you want 
any doc where an author is named john smith) isn't really possible -- 
but in your case you don't seem to really care about coordinating by 
position of the values in the multivalued field, because you have codes 
for hte companies as a prefix, so it doesn't matter where in the list it 
is.

If i'm missudnerstaninding your question, you'll need to explain better 
what docs you wnat to match, and what docs you *don't* want to match.


: I have a solr core with job records and one guy can work in different
: companies in
: a specific range of dateini to dateend.
: 
: doc  
:   arr name=companyinimultivaluefield
: companyiniIBM10012005companyini
: companyiniAPPLE10012005companyini
:   /arr  
:   arr name=companyendmultivaluefield
: companyendIBM10012005companyend
: companyendAPPLE10012005companyend
:   /arr
:  /doc  
: 
: Is possible to make a range query on a multivalue field over text fields.
: For instance something like that.
: companyinimultivaluefield['IBM10012005' TO *] AND 
: companyendmultivaluefield['IBM10012005' TO *]


-Hoss


Re: Search the contents of given URL in Solr.

2011-08-30 Thread Jayendra Patil
For indexing the webpages, you can use Nutch with Solr, which would do
the scarping and indexing of the page.
For finding similar documents/pages you can use
http://wiki.apache.org/solr/MoreLikeThis, by querying the above
document (by id or search terms) and it would return similar documents
from the index for the result.

Regards,
Jayendra

On Tue, Aug 30, 2011 at 8:23 AM, Sheetal rituzprad...@gmail.com wrote:
 Hi,

 Is it possible to give the URL address of a site and solr search server
 reads the contents of the given site and recommends similar projects to
 that. I did scrapped the web contents from the given URL address and now
 have the plain text format of the contents in URL. But when I pass that
 scrapped text as query into Solr. It doesn't work as query being too
 large(depends on the given contents of URL).

 I read it somewhere that its possible , Given the URL address and outputs
 you the relevant projects to it. But I don't remember whether its using Solr
 search or other search engine.

 Does anyone have any ideas or suggestions for this..Would highly appreciate
 your comments

 Thank you in advance..

 -
 Sheetal
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Search-the-contents-of-given-URL-in-Solr-tp3294376p3294376.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Solr 3.3 dismax MM parameter not working properly

2011-08-30 Thread Alexei Martchenko
Anyone else strugglin' with dismax's MM parameter?

We're having a problem here, seems that configs from 3 terms and more are
being ignored by solr and it assumes previous configs.

if I use str name=mm3lt;1/str or str name=mm3lt;100%/str i get
the same results for a 3-term query.
If i try str name=mm4lt;25%/str or str name=mm4lt;100%/str I
also get same data for a 4-term query.

I'm searching: windows service pack
str name=mm1lt;100% 2lt;50% 3lt;100%/str - 13000 results
str name=mm1lt;100% 2lt;50% 3lt;1/str - the same 13000 results
str name=mm1lt;100% 2lt;50%/str - very same 13000 results
str name=mm1lt;100% 2lt;100%/str - 93 results. seems that here i get
the 33 clause working.
str name=mm2lt;100%/str - same 93 results, just in case.
str name=mm2lt;50%/str - very same 13000 results as it should
str name=mm2lt;-50%/str - 1121 results (weird)

then i tried to control 3-term queries.

str name=mm2lt;-50% 3lt;100%/str 1121, the same as 2-50%, ignoring
the 3 clause.
str name=mm2lt;-50% 3lt;1/str the same 1121 results, ignoring again
it.

I'd like to accomplish something like this:
str name=mm2lt;1 3lt;2 4lt;3 8lt;-50%/str

translating: 1 or 2 - 1 term, 3 at least 2, 4 at least 3 and 5, 6, 7, 8
terms at least half rounded up (5-3, 6-3, 7-4, 8-4)

seems that he's only using 1 and 2 clauses.

thanks in advance

alexei


Re: Search the contents of given URL in Solr.

2011-08-30 Thread Sheetal
Hi Jayendra,

Thank you for the reply. I figured it out finally. I had to configure my web
servelet container Jetty for this..Now it works:-)

-
Sheetal
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-the-contents-of-given-URL-in-Solr-tp3294376p3296487.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to get all the terms in a document as Luke does?

2011-08-30 Thread Jayendra Patil
you might want to check - http://wiki.apache.org/solr/TermVectorComponent
Should provide you with the term vectors with a lot of additional info.

Regards,
Jayendra

On Tue, Aug 30, 2011 at 3:34 AM, Gabriele Kahlout
gabri...@mysimpatico.com wrote:
 Hello,

 This time I'm trying to duplicate Luke's functionality of knowing which
 terms occur in a search result/document (w/o parsing it again). Any Solrj
 API to do that?

 P.S. I've also posted the question on
 SOhttp://stackoverflow.com/q/7219111/300248
 .

 On Wed, Jul 6, 2011 at 11:09 AM, Gabriele Kahlout
 gabri...@mysimpatico.comwrote:

 From you patch I see TermFreqVector  which provides the information I
 want.

 I also found FieldInvertState.getLength() which seems to be exactly what I
 want. I'm after the word count (sum of tf for every term in the doc). I'm
 just not sure whether FieldInvertState.getLength() returns just the number
 of terms (not multiplied by the frequency of each term - word count) or not
 though. It seems as if it returns word count, but I've not tested it
 sufficienctly.


 On Wed, Jul 6, 2011 at 1:39 AM, Trey Grainger 
 the.apache.t...@gmail.comwrote:

 Gabriele,

 I created a patch that does this about a year ago.  See
 https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
 1.4 and is based upon the Document Reconstructor in Luke.  The patch adds
 a
 link to the main solr admin page to a docinspector page which will
 reconstruct the document given a uniqueid (required).  Keep in mind that
 you're only looking at what's in the index for non-stored fields, not
 the
 original text.

 If you have any issues using this on the most recent release, let me know
 and I'd be happy to create a new patch for solr 3.3.  One of these days
 I'll
 remove the JSP dependency and this may eventually making it into trunk.

 Thanks,

 -Trey Grainger
 Search Technology Development Team Lead, Careerbuilder.com
 Site Architect, Celiaccess.com


 On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
 gabri...@mysimpatico.comwrote:

  Hello,
 
  With an inverted index the term is the key, and the documents are the
  values. Is it still however possible that given a document id I get the
  terms indexed for that document?
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
  time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).




 --
 Regards,
 K. Gabriele

 --- unchanged since 20/9/10 ---
 P.S. If the subject contains [LON] or the addressee acknowledges the
 receipt within 48 hours then I don't resend the email.
 subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
  Now + 48h) ⇒ ¬resend(I, this).

 If an email is sent by a sender that is not a trusted contact or the email
 does not contain a valid code then the email is not received. A valid code
 starts with a hyphen and ends with X.
 ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
 L(-[a-z]+[0-9]X)).



Re: Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Marc Jacobs
Hi Chris,

Thanks for the response.
Eventualy I want to install Solr on a machine with a maximum memory of 4GB.
I tried to index the data on that machine before, but it resulted in index
locks and memory errors.
Is 4GB not enough to index 100,000 documents in a row? How much should it
be? Is there a way to tune this?

Regards,

Marc

2011/8/30 Chris Hostetter hossman_luc...@fucit.org


 : The current system I'm using has 150GB of memory and while I'm indexing
 the
 : memoryconsumption is growing and growing (eventually more then 50GB).
 : In the attached graph (http://postimage.org/image/acyv7kec/) I indexed
 about
 : 70k of office-documents (pdf,doc,xls etc) and between 1 and 2 percent
 throws

 Unless i'm missunderstanding sometihng about your graph, only ~12GB of
 memory is used by applications on that machine.  About 60GB is in use by
 the filesystem cache.

 The Filesystem cache is not memory being used by Solr, it's memory that is
 free and not in use by an application, so your OS is (wisely) using it to
 cache files from disk that you've recently accessed in case you need them
 again.  This is handy, and for max efficients (when keeping your index on
 disk) it's useful to make sure you allocate resources so that you have
 enough extra memory on your server that the entire index can be kept in
 the filesystem cache -- but the OS will happily free up that space for
 other apps that need it if they ask for more memory.

 : After indexing the memoryconsumption isn't dropping. Even after an
 optimize
 : command it's still there.

 as for why your Used memory grows to ~12GB and doesn't decrease even
 after an optimize: that's the way the Java memory model works.  whe nyou
 run the JVM you specificy (either explicitly or implicitly via defaults) a
 min  max heap size for hte JVM to allocate for itself.  it starts out
 asking the OS for the min, and as it needs more it asks for more up to the
 max.  but (most JVM implementations i know of) don't give back ram to
 the OS if they don't need it anymore -- they keep it as free space in the
 heap for future object allocation.



 -Hoss



Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread samuele.mattiuzzo
i thinki i have to drop the singleton class solution, since my boss wants to
add 2 other different solr installation and i need to reuse the plugins i'm
working on... so i'll have to use a connectionpool or i will create hangs
when the 3 cores update their indexes at the same time :(

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3296627.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Viewing the complete document from within the index

2011-08-30 Thread Jaeger, Jay - DOT
 I am trying
 to peek into the index to see if my index-time synonym expansions are
 working properly or not.

For this I have successfully used the analysis page of the admin application 
that comes out of the box.  Works really well for debugging schema changes.

JRJ

-Original Message-
From: Paul Libbrecht [mailto:p...@hoplahup.net] 
Sent: Saturday, August 27, 2011 5:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Viewing the complete document from within the index

Karthik,

I sure could be wrong but I never found this.

My search tool implementations (3 thus far, one on solr, all on the web) have 
always proceeded with one tool for experts called something like indexed-view 
which basically remade the indexing process as a dry-run. 

This can also be done with analysis but I have not with solr yet. I personaly 
find it would be nice to have a post servlet within solr that would do exactly 
that: returned the array of indexed token-streams, provided I send it the 
document data. I think you would see what you are looking for below then./

paul



Le 26 août 2011 à 23:40, karthik a écrit :

 Hi Everyone,
 
 I am trying to see whats the best way to view the entire document as its
 indexed within solr/lucene. I have tried to use Luke but it's still showing
 me the fields that i have configured to be returned back [ie., stored=true]
 unless I am not enabling some option in the tool.
 
 Is there a way to see whats actually stored in the index itself? I am trying
 to peek into the index to see if my index-time synonym expansions are
 working properly or not. The field for which I have enabled index-time
 synonym expansion is just used for searching so i have set stored=false.
 
 Thanks



RE: add documents to the slave

2011-08-30 Thread Jaeger, Jay - DOT
Another way that occurs to me is that if you have a securityconstraint on the 
update URL(s) in your web.xml, you can map them to no groups / empty groups in 
the JEE container.

JRJ

-Original Message-
From: simon [mailto:mtnes...@gmail.com] 
Sent: Tuesday, August 30, 2011 12:21 PM
To: solr-user@lucene.apache.org
Subject: Re: add documents to the slave

That's basically it.

remove all /update URLs from the slave config

On Tue, Aug 30, 2011 at 8:34 AM, Miguel Valencia 
miguel.valen...@juntadeandalucia.es wrote:

 Hi

I've read that it's possible add documents to slave machine:

 http://wiki.apache.org/solr/**SolrReplication#What_if_I_add_**
 documents_to_the_slave_or_if_**slave_index_gets_corrupted.3Fhttp://wiki.apache.org/solr/SolrReplication#What_if_I_add_documents_to_the_slave_or_if_slave_index_gets_corrupted.3F

 ¿Is there anyway to not allow add to documents to slave machine? for
 example, touch on configurations files to only allow handler /select.

 Thanks.





RE: missing field in schema browser on solr admin

2011-08-30 Thread Jaeger, Jay - DOT
Also...  Did he restart either his web app server container or at least the 
Solr servlet inside the container?

JRJ

-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Friday, August 26, 2011 5:29 AM
To: solr-user@lucene.apache.org
Subject: Re: missing field in schema browser on solr admin

Is the field stored?  Do you see it on documents when you do a q=*:* search?

How is that field defined and populated?  (exact config/code needed here)

Erik

On Aug 25, 2011, at 23:07 , deniz wrote:

 hi all...
 
 i have added a new field to index... but now when i check solr admin, i see
 some interesting stuff...
 
 i can see the field in schema and also db config file but there is nothing
 about the field in schema browser... in addition i cant make a search in
 that field... all of the config files seem correct but still no change...
 
 
 any ideas or anyone who has ever had a similar problem?
 
 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/missing-field-in-schema-browser-on-solr-admin-tp3285739p3285739.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 3.3 dismax MM parameter not working properly

2011-08-30 Thread Alexei Martchenko
Hmmm I believe I discovered the problem.

When you have something like this:

250% 6-60%

you should read it from right to left and use the word MORE.

MORE THAN SIX clauses, 60% are optional, MORE THAN TWO clauses (and that
includes 3, 4 and 5 AND 6) half is mandatory.

if you wanna a special rule for 2 terms just add:

11 250% 6-60%

MORE THAN ONE clauses (2) should match 1.

NOW this makes sense!

2011/8/30 Alexei Martchenko ale...@superdownloads.com.br

 Anyone else strugglin' with dismax's MM parameter?

 We're having a problem here, seems that configs from 3 terms and more are
 being ignored by solr and it assumes previous configs.

 if I use str name=mm3lt;1/str or str name=mm3lt;100%/str i
 get the same results for a 3-term query.
 If i try str name=mm4lt;25%/str or str name=mm4lt;100%/str I
 also get same data for a 4-term query.

 I'm searching: windows service pack
 str name=mm1lt;100% 2lt;50% 3lt;100%/str - 13000 results
 str name=mm1lt;100% 2lt;50% 3lt;1/str - the same 13000 results
 str name=mm1lt;100% 2lt;50%/str - very same 13000 results
 str name=mm1lt;100% 2lt;100%/str - 93 results. seems that here i
 get the 33 clause working.
 str name=mm2lt;100%/str - same 93 results, just in case.
 str name=mm2lt;50%/str - very same 13000 results as it should
 str name=mm2lt;-50%/str - 1121 results (weird)

 then i tried to control 3-term queries.

 str name=mm2lt;-50% 3lt;100%/str 1121, the same as 2-50%, ignoring
 the 3 clause.
 str name=mm2lt;-50% 3lt;1/str the same 1121 results, ignoring again
 it.

 I'd like to accomplish something like this:
 str name=mm2lt;1 3lt;2 4lt;3 8lt;-50%/str

 translating: 1 or 2 - 1 term, 3 at least 2, 4 at least 3 and 5, 6, 7, 8
 terms at least half rounded up (5-3, 6-3, 7-4, 8-4)

 seems that he's only using 1 and 2 clauses.

 thanks in advance

 alexei




-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


Re: Stream still in memory after tika exception? Possible memoryleak?

2011-08-30 Thread Erick Erickson
See solrconfig.xml, particularly ramBufferSizeMB,
also maxBufferedDocs.

There's no reason you can't index as many documents
as you want, unless your documents are absolutely
huge (as in 100s of M, possibly G size).

Are you actually getting out of memory problems?

Erick

On Tue, Aug 30, 2011 at 4:24 PM, Marc Jacobs jacob...@gmail.com wrote:
 Hi Chris,

 Thanks for the response.
 Eventualy I want to install Solr on a machine with a maximum memory of 4GB.
 I tried to index the data on that machine before, but it resulted in index
 locks and memory errors.
 Is 4GB not enough to index 100,000 documents in a row? How much should it
 be? Is there a way to tune this?

 Regards,

 Marc

 2011/8/30 Chris Hostetter hossman_luc...@fucit.org


 : The current system I'm using has 150GB of memory and while I'm indexing
 the
 : memoryconsumption is growing and growing (eventually more then 50GB).
 : In the attached graph (http://postimage.org/image/acyv7kec/) I indexed
 about
 : 70k of office-documents (pdf,doc,xls etc) and between 1 and 2 percent
 throws

 Unless i'm missunderstanding sometihng about your graph, only ~12GB of
 memory is used by applications on that machine.  About 60GB is in use by
 the filesystem cache.

 The Filesystem cache is not memory being used by Solr, it's memory that is
 free and not in use by an application, so your OS is (wisely) using it to
 cache files from disk that you've recently accessed in case you need them
 again.  This is handy, and for max efficients (when keeping your index on
 disk) it's useful to make sure you allocate resources so that you have
 enough extra memory on your server that the entire index can be kept in
 the filesystem cache -- but the OS will happily free up that space for
 other apps that need it if they ask for more memory.

 : After indexing the memoryconsumption isn't dropping. Even after an
 optimize
 : command it's still there.

 as for why your Used memory grows to ~12GB and doesn't decrease even
 after an optimize: that's the way the Java memory model works.  whe nyou
 run the JVM you specificy (either explicitly or implicitly via defaults) a
 min  max heap size for hte JVM to allocate for itself.  it starts out
 asking the OS for the min, and as it needs more it asks for more up to the
 max.  but (most JVM implementations i know of) don't give back ram to
 the OS if they don't need it anymore -- they keep it as free space in the
 heap for future object allocation.



 -Hoss




Re: Solr custom plugins: is it possible to have them persistent?

2011-08-30 Thread Erick Erickson
Well, your singleton can be the connection
pool manager..

Best
Erick

On Tue, Aug 30, 2011 at 4:45 PM, samuele.mattiuzzo samum...@gmail.com wrote:
 i thinki i have to drop the singleton class solution, since my boss wants to
 add 2 other different solr installation and i need to reuse the plugins i'm
 working on... so i'll have to use a connectionpool or i will create hangs
 when the 3 cores update their indexes at the same time :(

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-custom-plugins-is-it-possible-to-have-them-persistent-tp3292781p3296627.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 3.3 dismax MM parameter not working properly

2011-08-30 Thread Erick Erickson
Yep, that one takes a while to figure out, then
I wind up re-figuring it out every time I have
to change it G...

Best
Erick

On Tue, Aug 30, 2011 at 6:36 PM, Alexei Martchenko
ale...@superdownloads.com.br wrote:
 Hmmm I believe I discovered the problem.

 When you have something like this:

 250% 6-60%

 you should read it from right to left and use the word MORE.

 MORE THAN SIX clauses, 60% are optional, MORE THAN TWO clauses (and that
 includes 3, 4 and 5 AND 6) half is mandatory.

 if you wanna a special rule for 2 terms just add:

 11 250% 6-60%

 MORE THAN ONE clauses (2) should match 1.

 NOW this makes sense!

 2011/8/30 Alexei Martchenko ale...@superdownloads.com.br

 Anyone else strugglin' with dismax's MM parameter?

 We're having a problem here, seems that configs from 3 terms and more are
 being ignored by solr and it assumes previous configs.

 if I use str name=mm3lt;1/str or str name=mm3lt;100%/str i
 get the same results for a 3-term query.
 If i try str name=mm4lt;25%/str or str name=mm4lt;100%/str I
 also get same data for a 4-term query.

 I'm searching: windows service pack
 str name=mm1lt;100% 2lt;50% 3lt;100%/str - 13000 results
 str name=mm1lt;100% 2lt;50% 3lt;1/str - the same 13000 results
 str name=mm1lt;100% 2lt;50%/str - very same 13000 results
 str name=mm1lt;100% 2lt;100%/str - 93 results. seems that here i
 get the 33 clause working.
 str name=mm2lt;100%/str - same 93 results, just in case.
 str name=mm2lt;50%/str - very same 13000 results as it should
 str name=mm2lt;-50%/str - 1121 results (weird)

 then i tried to control 3-term queries.

 str name=mm2lt;-50% 3lt;100%/str 1121, the same as 2-50%, ignoring
 the 3 clause.
 str name=mm2lt;-50% 3lt;1/str the same 1121 results, ignoring again
 it.

 I'd like to accomplish something like this:
 str name=mm2lt;1 3lt;2 4lt;3 8lt;-50%/str

 translating: 1 or 2 - 1 term, 3 at least 2, 4 at least 3 and 5, 6, 7, 8
 terms at least half rounded up (5-3, 6-3, 7-4, 8-4)

 seems that he's only using 1 and 2 clauses.

 thanks in advance

 alexei




 --

 *Alexei Martchenko* | *CEO* | Superdownloads
 ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
 5083.1018/5080.3535/5080.3533



Re: Shingle and Query Performance

2011-08-30 Thread Erick Erickson
OK, I'll have to defer because this makes no sense.
4+ seconds in the debug component?

Sorry I can't be more help here, but nothing really
jumps out.
Erick

On Tue, Aug 30, 2011 at 12:45 PM, Lord Khan Han khanuniver...@gmail.com wrote:
 Below the output of the debug. I am measuring pure solr qtime which show in
 the Qtime field in solr xml.

 arr name=parsed_filter_queries
 strmrank:[0 TO 100]/str
 /arr
 lst name=timing
 double name=time8584.0/double
 lst name=prepare
 double name=time12.0/double
 lst name=org.apache.solr.handler.component.QueryComponent
 double name=time12.0/double
 /lst
 lst name=org.apache.solr.handler.component.FacetComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.MoreLikeThisComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.HighlightComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.StatsComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.SpellCheckComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.DebugComponent
 double name=time0.0/double
 /lst
 /lst
 lst name=process
 double name=time8572.0/double
 lst name=org.apache.solr.handler.component.QueryComponent
 double name=time4480.0/double
 /lst
 lst name=org.apache.solr.handler.component.FacetComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.MoreLikeThisComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.HighlightComponent
 double name=time41.0/double
 /lst
 lst name=org.apache.solr.handler.component.StatsComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.SpellCheckComponent
 double name=time0.0/double
 /lst
 lst name=org.apache.solr.handler.component.DebugComponent
 double name=time4051.0/double
 /lst

 On Tue, Aug 30, 2011 at 5:38 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Can we see the output if you specify both
 debugQuery=ondebug=true

 the debug=true will show the time taken up with various
 components, which is sometimes surprising...

 Second, we never asked the most basic question, what are
 you measuring? Is this the QTime of the returned response?
 (which is the time actually spent searching) or the time until
 the response gets back to the client, which may involve lots besides
 searching...

 Best
 Erick

 On Tue, Aug 30, 2011 at 7:59 AM, Lord Khan Han khanuniver...@gmail.com
 wrote:
  Hi Eric,
 
  Fields are lazy loading, content stored in solr and machine 32 gig.. solr
  has 20 gig heap. There is no swapping.
 
  As you see we have many phrases in the same query . I couldnt find a way
 to
  drop qtime to subsecends. Suprisingly non shingled test better qtime !
 
 
  On Mon, Aug 29, 2011 at 3:10 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Oh, one other thing: have you profiled your machine
  to see if you're swapping? How much memory are
  you giving your JVM? What is the underlying
  hardware setup?
 
  Best
  Erick
 
  On Mon, Aug 29, 2011 at 8:09 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
   200K docs and 36G index? It sounds like you're storing
   your documents in the Solr index. In and of itself, that
   shouldn't hurt your query times, *unless* you have
   lazy field loading turned off, have you checked that
   lazy field loading is enabled?
  
  
  
   Best
   Erick
  
   On Sun, Aug 28, 2011 at 5:30 AM, Lord Khan Han 
 khanuniver...@gmail.com
  wrote:
   Another insteresting thing is : all one word or more word queries
  including
   phrase queries such as barack obama  slower in shingle
 configuration.
  What
   i am doing wrong ? without shingle barack obama Querytime 300ms
  with
   shingle  780 ms..
  
  
   On Sat, Aug 27, 2011 at 7:58 PM, Lord Khan Han 
 khanuniver...@gmail.com
  wrote:
  
   Hi,
  
   What is the difference between solr 3.3  and the trunk ?
   I will try 3.3  and let you know the results.
  
  
   Here the search handler:
  
   requestHandler name=search class=solr.SearchHandler
  default=true
        lst name=defaults
          str name=echoParamsexplicit/str
          int name=rows10/int
          !--str name=fqcategory:vv/str--
    str name=fqmrank:[0 TO 100]/str
          str name=echoParamsexplicit/str
          int name=rows10/int
    str name=defTypeedismax/str
          !--str name=qftitle^0.05 url^1.2 content^1.7
   m_title^10.0/str--
   str name=qftitle^1.05 url^1.2 content^1.7 m_title^10.0/str
    !-- str name=bfrecip(ee_score,-0.85,1,0.2)/str --
    str name=pfcontent^18.0 m_title^5.0/str
    int name=ps1/int
    int name=qs0/int
    str name=mm2lt;-25%/str
    str name=spellchecktrue/str
    !--str name=spellcheck.collatetrue/str   --
   str name=spellcheck.count5/str
    str name=spellcheck.dictionarysubobjective/str
   str name=spellcheck.onlyMorePopularfalse/str
     str name=hl.tag.prelt;bgt;/str
   str name=hl.tag.postlt;/bgt;/str
    str 

Re: Solr Geodist

2011-08-30 Thread Erick Erickson
That should be fine. I'm not actually sure what version of Trunk I
have, I update it sporadically and build from scratch. But the last
successful build artifacts will certainly have the pseudo-field
return of function in it, so you should be fine.

Best
Erick

On Tue, Aug 30, 2011 at 2:33 PM, solrnovice manisha...@yahoo.com wrote:
 I think i found the link to the nightly build, i am going to try this flavor
 of solr and run the query and check what happens.
 The link i am using is
 https://builds.apache.org/job/Solr-trunk/lastSuccessfulBuild/artifact/artifacts/

 thanks
 SN

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3296316.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Geodist

2011-08-30 Thread solrnovice
hi Erik, today i had the distance working. Since the solr version under
LucidImagination is not returning geodist(),  I downloaded Solr 4.0 from the
nightly build. On lucid we had the full schema defined. So i copied that
schema to the example directory of solr-4 and removed all references to
Lucid and started the index. 
I wanted to try our schema under solr-4.

Then i had the data indexed ( we have a rake written in ruby to index the
contents) and ran the geodist queries and they all run like a charm. I do
get distance as a pseudo column.

Is there any documentation that gives me all the arguments of geodist(), i
couldnt find it online.


Erick, thanks for your help in going through my examples. NOw they all work
on my solr installation.


thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3297088.html
Sent from the Solr - User mailing list archive at Nabble.com.


Why I can't take an full-import with entity name?

2011-08-30 Thread 于浩
I am using solr1.3,I updated solr index throgh solr delta import every two
hours. but the delta import is database connection wasteful.
So i want to use full-import with entity name instead of delta import.

my db-data-config.xml  file:
entity name=article pk=Article_ID  query=select
Article_ID,Article_Title,Article_Abstract from Article_Detail
field name=Article_ID column=Article_ID /
/entity
entity name=delta_article pk=Article_ID  rootEngity=false
 query=select Article_ID,Article_Title,Article_Abstract from Article_Detail
where Article_IDgt;'${dataimporter.request.minID}' and Article_ID
lt;='{dataimporter.request.maxID}'

field name=Article_ID column=Article_ID /
/entity


then I uses
http://192.168.1.98:8081/solr/db_article/dataimport?command=full-importentity=delta_articlecommit=trueclean=falsemaxID=1000minID=10
but the solr will finish nearyly instant,and there is no any record
imported. but what the fact is there are many records meets the condtion of
maxID and minID.


the tomcat log:
信息: [db_article] webapp=/solr path=/dataimport
params={maxID=6737277clean=falsecommit=trueentity=delta_articlecommand=full-importminID=6736841}
status=0 QTime=0
2011-8-29 19:00:03 org.apache.solr.handler.dataimport.DataImporter
doFullImport
信息: Starting Full Import
2011-8-29 19:00:03 org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
信息: Read dataimport.properties
2011-8-29 19:00:03 org.apache.solr.handler.dataimport.SolrWriter
persistStartTime
信息: Wrote last indexed time to dataimport.properties
2011-8-29 19:00:03 org.apache.solr.handler.dataimport.DocBuilder commit
信息: Full Import completed successfully


some body who can help or some advices?


Re: Solr Geodist

2011-08-30 Thread Lance Norskog
Lucid also has an online forum for questions about the LucidWorksEnterprise
product:

http://www.lucidimagination.com/forum/lwe

The Lucidi Imagination engineers all read the forum and endeavor to quickly
answer questions like this.

On Tue, Aug 30, 2011 at 6:09 PM, solrnovice manisha...@yahoo.com wrote:

 hi Erik, today i had the distance working. Since the solr version under
 LucidImagination is not returning geodist(),  I downloaded Solr 4.0 from
 the
 nightly build. On lucid we had the full schema defined. So i copied that
 schema to the example directory of solr-4 and removed all references to
 Lucid and started the index.
 I wanted to try our schema under solr-4.

 Then i had the data indexed ( we have a rake written in ruby to index the
 contents) and ran the geodist queries and they all run like a charm. I do
 get distance as a pseudo column.

 Is there any documentation that gives me all the arguments of geodist(), i
 couldnt find it online.


 Erick, thanks for your help in going through my examples. NOw they all work
 on my solr installation.


 thanks
 SN

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3297088.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Changing the DocCollector

2011-08-30 Thread Jamie Johnson
So I looked at doing this, but I don't see a way to get the scores
from the docs as well.  Am I missing something in that regards?

On Mon, Aug 29, 2011 at 8:53 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks Hoss.  I am actually ok with that, I think something like
 50,000 results from each shard as a max would be reasonable since my
 check takes about 1s for 50,000 records.  I'll give this a whirl and
 see how it goes.

 On Mon, Aug 29, 2011 at 6:46 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : Also I see that this is before sorting, is there a way to do something
 : similar after sorting?  The reason is that I'm ok with the total
 : result not being completely accurate so long as the first say 10 pages
 : are accurate.  The results could get more accurate as you page through
 : them though.  Does that make sense?

 munging results after sorting is dangerous in the general case, but if you
 have a specific usecase where you're okay with only garunteeing accurate
 results up to result #X, then you might be able to get away with something
 like...

 * custom SearchComponent
 * configure to run after QueryComponent
 * in prepare, record the start  rows params, and replace them with 0 
 (MAX_PAGE_NUM * rows)
 * in process, iterate over the the DocList and build up your own new
 DocSlice based on the docs that match your special criteria - then use the
 original start/rows to generate a subset and return that

 ...getting this to play nicely with stuff like faceting be possible with
 more work, and manipulation of the DocSet (assuming you're okay with the
 facet counts only being as accurate as much as the DocList is -- filtered
 up to row X).

 it could fail misserablly with distributed search since you hvae no idea
 how many results will pass your filter.

 (note: this is all off the top of my head ... no idea if it would actually
 work)



 -Hoss




Re: How to send an OpenBitSet object from Solr server?

2011-08-30 Thread Satish Talim
I was not referring to Lucene's doc ids but the doc numbers (unique key)

Satish



On Tue, Aug 30, 2011 at 9:28 PM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : We have a need to query and fetch millions of document ids from a Solr
 3.3
 : index and convert the same to a BitSet. To speed things up, we want to
 : convert these document ids into OpenBitSet on the server side, put them
 into
 : the response object and read the same on the client side.

 This smells like an XY Problem ... what do you intend to do with this
 BitSet on the client side?  the lucene doc ids are meaningless outside of
 hte server, and for any given doc, the id could change from one request to
 the next -- so how would having this data on the clinet be of any use to
 you?

 https://people.apache.org/~hossman/#xyproblem
 XY Problem

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341


 -Hoss



solrClound

2011-08-30 Thread 陈葵
Hi all:
   
 I'm using SolrClound for distributed  search. Everything comes to well but 
there is a small problem.Each node have searched quickly and process the data 
to top request,I found a request like 
 q=solrids=id1,id2,id3,id4,id5,id6...,id10.
Solr handles this request with a 'for' loop.Each id cost 300ms or so。If there 
are 100 per page,the costs willn't be tolerate。
  
 And also,I'm studying solrclund hard because of lacking of documents.I 
just found in http://wiki.apache.org/solr/SolrCloud.Is there any others?
  
Hope your answer.


Re: Solr Geodist

2011-08-30 Thread solrnovice
hi Lance, thanks for the link, i went to their site, lucidimagination forum,
when i searched on geodist, i see my own posts. Is this forum part of
lucidimagination?

Just curious.

thanks
SN

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Geodist-tp3287005p3297262.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Changing the DocCollector

2011-08-30 Thread Jamie Johnson
Found score, so this works for regular queries but now I'm getting an
exception when faceting.

SEVERE: Exception during facet.field of type:java.lang.NullPointerException
at 
org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:451)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:313)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:357)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:191)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:81)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1290)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

Any insight into what would cause that?

On Tue, Aug 30, 2011 at 10:13 PM, Jamie Johnson jej2...@gmail.com wrote:
 So I looked at doing this, but I don't see a way to get the scores
 from the docs as well.  Am I missing something in that regards?

 On Mon, Aug 29, 2011 at 8:53 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks Hoss.  I am actually ok with that, I think something like
 50,000 results from each shard as a max would be reasonable since my
 check takes about 1s for 50,000 records.  I'll give this a whirl and
 see how it goes.

 On Mon, Aug 29, 2011 at 6:46 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : Also I see that this is before sorting, is there a way to do something
 : similar after sorting?  The reason is that I'm ok with the total
 : result not being completely accurate so long as the first say 10 pages
 : are accurate.  The results could get more accurate as you page through
 : them though.  Does that make sense?

 munging results after sorting is dangerous in the general case, but if you
 have a specific usecase where you're okay with only garunteeing accurate
 results up to result #X, then you might be able to get away with something
 like...

 * custom SearchComponent
 * configure to run after QueryComponent
 * in prepare, record the start  rows params, and replace them with 0 
 (MAX_PAGE_NUM * rows)
 * in process, iterate over the the DocList and build up your own new
 DocSlice based on the docs that match your special criteria - then use the
 original start/rows to generate a subset and return that

 ...getting this to play nicely with stuff like faceting be possible with
 more work, and manipulation of the DocSet (assuming you're okay with the
 facet counts only being as accurate as much as the DocList is -- filtered
 up to row X).

 it could fail misserablly with distributed search since you hvae no idea
 how many results will pass your filter.

 (note: this is all off the top of my head ... no idea if it would actually
 work)



 -Hoss





Duplication of Output

2011-08-30 Thread Aaron Bains
Hello,

What is the best way to remove duplicate values on output. I am using the
following query:

/solr/select/?q=wrt54g2version=2.2start=0rows=10indent=on*fl=productid*

And I get the following results:

doc
int name=productid1011630553/int
/doc
doc
int name=productid1011630553/int
/doc
docint name=productid1011630553/int
/doc
docint name=productid1011630553/int
/doc
docint name=productid1011630553/int
/doc
docint name=productid1011630553/int
/doc
docint name=productid1011630553/int
/doc
docint name=productid1013033708/int
/doc
docint name=productid1013033708/int
/doc
docint name=productid1013033708/int
/doc


But I don't want those results because there are duplicates. I am looking
for results like below:

doc
int name=productid1011630553/int
/doc
doc
int name=productid1013033708/int
/doc

I know there is deduplication and field collapsing but I am not sure if they
are applicable in this situation. Thanks for your help!


Re: How to get all the terms in a document as Luke does?

2011-08-30 Thread Gabriele Kahlout
The Term Vector Component (TVC) is a SearchComponent designed to return
information about documents that is stored when setting the termVector
attribute on a field:

Will I have to re-index after adding that to the schema?

On Tue, Aug 30, 2011 at 11:06 PM, Jayendra Patil 
jayendra.patil@gmail.com wrote:

 you might want to check - http://wiki.apache.org/solr/TermVectorComponent
 Should provide you with the term vectors with a lot of additional info.

 Regards,
 Jayendra

 On Tue, Aug 30, 2011 at 3:34 AM, Gabriele Kahlout
 gabri...@mysimpatico.com wrote:
  Hello,
 
  This time I'm trying to duplicate Luke's functionality of knowing which
  terms occur in a search result/document (w/o parsing it again). Any Solrj
  API to do that?
 
  P.S. I've also posted the question on
  SOhttp://stackoverflow.com/q/7219111/300248
  .
 
  On Wed, Jul 6, 2011 at 11:09 AM, Gabriele Kahlout
  gabri...@mysimpatico.comwrote:
 
  From you patch I see TermFreqVector  which provides the information I
  want.
 
  I also found FieldInvertState.getLength() which seems to be exactly what
 I
  want. I'm after the word count (sum of tf for every term in the doc).
 I'm
  just not sure whether FieldInvertState.getLength() returns just the
 number
  of terms (not multiplied by the frequency of each term - word count) or
 not
  though. It seems as if it returns word count, but I've not tested it
  sufficienctly.
 
 
  On Wed, Jul 6, 2011 at 1:39 AM, Trey Grainger 
 the.apache.t...@gmail.comwrote:
 
  Gabriele,
 
  I created a patch that does this about a year ago.  See
  https://issues.apache.org/jira/browse/SOLR-1837.  It was written for
 Solr
  1.4 and is based upon the Document Reconstructor in Luke.  The patch
 adds
  a
  link to the main solr admin page to a docinspector page which will
  reconstruct the document given a uniqueid (required).  Keep in mind
 that
  you're only looking at what's in the index for non-stored fields, not
  the
  original text.
 
  If you have any issues using this on the most recent release, let me
 know
  and I'd be happy to create a new patch for solr 3.3.  One of these days
  I'll
  remove the JSP dependency and this may eventually making it into trunk.
 
  Thanks,
 
  -Trey Grainger
  Search Technology Development Team Lead, Careerbuilder.com
  Site Architect, Celiaccess.com
 
 
  On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
  gabri...@mysimpatico.comwrote:
 
   Hello,
  
   With an inverted index the term is the key, and the documents are the
   values. Is it still however possible that given a document id I get
 the
   terms indexed for that document?
  
   --
   Regards,
   K. Gabriele
  
   --- unchanged since 20/9/10 ---
   P.S. If the subject contains [LON] or the addressee acknowledges
 the
   receipt within 48 hours then I don't resend the email.
   subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
   time(x)
Now + 48h) ⇒ ¬resend(I, this).
  
   If an email is sent by a sender that is not a trusted contact or the
  email
   does not contain a valid code then the email is not received. A valid
  code
   starts with a hyphen and ends with X.
   ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧
 y ∈
   L(-[a-z]+[0-9]X)).
  
 
 
 
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
  time(x)  Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 
 
 
 
  --
  Regards,
  K. Gabriele
 
  --- unchanged since 20/9/10 ---
  P.S. If the subject contains [LON] or the addressee acknowledges the
  receipt within 48 hours then I don't resend the email.
  subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
 time(x)
   Now + 48h) ⇒ ¬resend(I, this).
 
  If an email is sent by a sender that is not a trusted contact or the
 email
  does not contain a valid code then the email is not received. A valid
 code
  starts with a hyphen and ends with X.
  ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
  L(-[a-z]+[0-9]X)).
 




-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. 

shareSchema=true - location of schema.xml?

2011-08-30 Thread Satish Talim
I have 1000's of cores and to reduce the cost of loading unloading
schema.xml, I have my solr.xml as mentioned here -
http://wiki.apache.org/solr/CoreAdmin
namely:

solr
  cores adminPath=/admin/cores shareSchema=true
...
  /cores
/solr

However, I am not sure where to keep the common schema.xml file? In which
case, do I need the schema.xml in the conf folder of each and every core?

My folder structure is:

 multicore (contains solr.xml)
|_ core0
 |_ conf
 ||_ schema.xml
 ||_ solrconfig.xml
 ||_ other files
   core1
 |_ conf
 ||_ schema.xml
 ||_ solrconfig.xml
 ||_ other files
 |
   exampledocs (contains 1000's of .csv files and post.jar)

Satish