Re: DataimportHandler development issue

2011-01-14 Thread Gora Mohanty
On Fri, Jan 14, 2011 at 12:17 AM, Derek Werthmuller
dwert...@ctg.albany.edu wrote:

 Its not clear why its not working.  Advice?
 Also is this the best way to load data?  We intent on loading several
 thousand docbook documents once we understand how this all works.  We stuck
 with the rss/atom example since we didn't want to deal with schema changes
 yet.
 Thanks
        Derek

 example-DIH/solr/rss/conf/rss-data-config.xml  modified source:
 dataConfig
 dataSource type=URLDataSource /
 document
 entity name=slashdot
 pk=link
 url=http://twitter.com/statuses/user_timeline/existdb.rss;
 processor=XPathEntityProcessor
 forEach=/rss/channel | /rss/channel/item
 transformer=DateFormatTransformer

 field column=source xpath=/rss/channel/title commonField=true /
 field column=source-link xpath=/rss/channel/link commonField=true /
 field column=subject xpath=/rss/channel/subject commonField=true /

 field column=title xpath=/rss/channel/item/title /
 field column=link xpath=/rss/channel/item/link /
 field column=description xpath=/rss/channel/item/description /
 field column=creator xpath=/rss/channel/item/creator /
 field column=item-subject xpath=/rss/channel/item/subject /
 field column=date xpath=/rss/channel/item/date
 dateTimeFormat=-MM-dd'T'hh:mm:ss /
 field column=slash-department xpath=/rss/channel/item/department /
 field column=slash-section xpath=/rss/channel/item/section /
 field column=slash-comments xpath=/rss/channel/item/comments /
 /entity

 entity name=twitter
 pk=link
 url=http://twitter.com/statuses/user_timeline/ctg_ualbany.atom;
 processor=XPathEntityProcessor
 forEach=/feed | /feed/entry
 transformer=DateFormatTransformer

 field column=source xpath=/feed/title commonField=true /
 field column=source-link xpath=/feed/link commonField=true /
 field column=subject xpath=/feed/subtitle commonField=true /

 field column=title xpath=/feed/entry/title /
 field column=link xpath=/feed/entry/link /
 field column=description xpath=/feed/entry/description /
 field column=creator xpath=/feed/entry/creator /
 field column=item-subject xpath=/feed/entry/subject /
 field column=date xpath=/rss/channel/item/date
 dateTimeFormat=-MM-dd'T'hh:mm:ss /
 field column=slash-department xpath=/feed/entry/department /
 field column=slash-section xpath=/feed/entry/section /
 field column=slash-comments xpath=/feed/entry/comments /
 /entity
 /document
 /dataConfig

Your problem is the second entity in the DIH configuration file. The
Solr schema defines the unique key to be the field link. As noted in
the comments in schema.xml, this means that this field is required.
Solr is not able to populate the link field from the Atom feed. I have
not tracked down why this is so, but it is probably because there is
more than one link node under /feed/entry, and the link field is not
multi-valued. Change the xpath to, say, /feed/entry/id, and the
import works. Also, while this is not necessarily an issue, please
note that several other fields have incorrect xpaths for this entity.

To answer your other question, this way of importing data should
work fine.

Regards,
Gora


Re: Improving Solr performance

2011-01-14 Thread supersoft

The tests are performed with a selfmade program. The arguments are the number
of threads and the path to a file which contains available queries (in the
last test only one). When each thread is created, it gets the current date
(in milisecs), and when it gets the response from the query, the thread logs
the diff with that initial date. 

In the last post, I wrote the results of the 100 threads example orderered
by the response date. The results ordered by the creation date are:

100 simultaneous queries: 9265, 11922, 12375, 4109, 4890, 7093, 21875, 8547,
13562, 13219, 1531, 11875, 21281, 31985, 11703, 7391, 32031, 22172, 21469,
13875, 1969, 11406, 8172, 9609, 16953, 13828, 17282, 22141, 16625, 2203,
24985, 2375, 25188, 2891, 5047, 6422, 20860, 7594, 23125, 32281, 32016,
5312, 23125, 11484, 10344, 11500, 18172, 3937, 11547, 13500, 28297, 20594,
24641, 7063, 24797, 12922, 1297, 8984, 20625, 13407, 23203, 32016, 15922,
21875, 8750, 12875, 23203, 26453, 26016, 11797, 31782, 24672, 21625, 7672,
18985, 14672, 22157, 26485, 23328, 9907, 5563, 24625, 14078, 4703, 25844,
12328, 11484, 6437, 25937, 26437, 18484, 13719, 16328, 28687, 23141, 14016,
26437, 13187, 25031, 31969
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2254121.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.0 = Spatial Search - How to

2011-01-14 Thread Stefan Matheis
caman,

how did you try to concat them? perhaps some typecasting would do the trick?

Stefan

On Fri, Jan 14, 2011 at 7:20 AM, caman aboxfortheotherst...@gmail.comwrote:


 Thanks
 Here was the issues. Concatenating 2 floats(lat,lng) at mysql end converted
 it to a BLOB. Indexing would fail in storing BLOB in 'location' type field.
 After BLOB issue was resolved, all worked ok.

 Thank you all for your help



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2253691.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dismax, Sharding and Elevation

2011-01-14 Thread Oliver Marahrens
Hi,

thank you for your reply, Grijesh. But Elevation in general works with
sharding - if I used the Standard Request Handler instead of Dismax. I
just wonder how (or if) it could work also with dismax. I think its not
a problem of distributed search, but one of dismax (perhaps combined
with distributed search).

Oliver

Grijesh.singh schrieb:
 As I seen the code for QueryElevationComponent ,there is no supports for
 Distributed Search i.e. query elevation does not works with shards.

 -
 Grijesh
   


-- 
Oliver Marahrens
TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste
Denickestr. 22
21071 Hamburg - Harburg
Tel.+49 (0)40 / 428 78 - 32 91
eMail   o.marahr...@tu-harburg.de
--
GPG/PGP-Schlüssel: 
http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc
--
Projekt DISCUS http://discus.tu-harburg.de
Projekt TUBdok http://doku.b.tu-harburg.de



Solr and Ping PHP

2011-01-14 Thread stockii

Hello.

Iam using NRT and for each search-request, updater-request and
commit-request (on the search-instance) i start a ping to solr with a
httpRequest.


But sometimes ping isnt okay, but sor is available. Why cannot solr ping,
when he is doing something like Commit on my searcher or when a
search-request is running ? 

i get every night minimum one Error Message and thats really sucks ... 



-- System -

One Server, 12 GB RAM, 2 Solr Instances, 7 Cores, 
1 Core with 31 Million Documents other under 100.000

- Solr1 for Search-Requests - commit every Minute  - 4GB Xmx
- Solr2 for Update-Request  - delta every 2 Minutes - 4GB Xmx


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-and-Ping-PHP-tp2254214p2254214.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Searchers and Warmups

2011-01-14 Thread Savvas-Andreas Moysidis
Hi David,

maybe the wiki page on caching could be helpful:
http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners

http://wiki.apache.org/solr/SolrCaching#newSearcher_and_firstSearcher_Event_Listeners
Regards,
- Savvas

On 14 January 2011 00:08, David Cramer dcra...@gmail.com wrote:

 I'm trying to understand the mechanics behind warming up, when new
 searchers are registered, and their costs. A quick Google didn't point me in
 the right direction, so hoping for some of that here.


 --
 David Cramer





Re: Searchers and Warmups

2011-01-14 Thread Tommaso Teofili
Hi David,
The idea is that you can define some listeners which make a list of
queries to an IndexSearcher.
In particular the firstSearcher event is related to the very first
IndexSearcher being created inside the Solr instance while the newSearcher
is the event related to the creation of a new IndexSearcher (i.e. when a
commit is done the not used ones get closed and new IS are created with the
last commit point).
The warming up is simply the execution of particular queries against such
IndexSearchers in order to put some documents in the caches before any user
entered query is executed so that the searchers are warmed with proper
documents (i.e. most frequent queries). Also some documents in the old
caches get inside the caches of the new searchers depending on the cache
configuration [1].
I hope this clarifies things a little bit.
Cheers,
Tommaso

[1] :
http://wiki.apache.org/solr/SolrPerformanceFactors#Cache_autoWarm_Count_Considerations

2011/1/14 David Cramer dcra...@gmail.com

 I'm trying to understand the mechanics behind warming up, when new
 searchers are registered, and their costs. A quick Google didn't point me in
 the right direction, so hoping for some of that here.


 --
 David Cramer





Re: Solr 4.0 = Spatial Search - How to

2011-01-14 Thread Stefan Matheis
absolutely no idea why it is a blob .. but the following one works as
expected:

CAST( CONCAT( lat, ',', lng ) AS CHAR )

HTH
Stefan

On Fri, Jan 14, 2011 at 9:31 AM, caman aboxfortheotherst...@gmail.comwrote:



 CONCAT(CAST(lat as CHAR),',',CAST(lng as CHAR))
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2254151.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Schema design FAQs/questions

2011-01-14 Thread Matthias Pigulla
Dear Solr-users,

is there a compilation of FAQs particularly targeting at schema design? I have 
a two questions that probably have been asked before:

- I have to map different kinds of documents into my schema. Some of these 
documents have one or multiple time/dates that might be relevant for querying 
or sorting. I feel that it would be best to keep dates with different semantics 
in different fields. Does it pose any problems when some of these fields are 
filled only for certain documents? Probably when such a field is used for 
sorting, documents not providing a field value will be at one end of the result 
set, depending on sort order?

- Sometimes several documents belong together as they are part of a bigger 
concept. I could keep a reference to this concept along with every document in 
the index. Now would it be possible to perform a search where hits on documents 
are grouped by these concepts? That is, I would like to get a result list that 
contains *only one* entry per concept but for each of these entry gives me a 
hit which document(s) contained the match?

Thanks a lot!
-mp.


solr speed issues..

2011-01-14 Thread saureen

I am working on an application that requires fetching results from solr based
on date parameter..earlier i was using sharding to fetch the results but
that was making things too slow,so instead of sharding,i queried on three
different cores with the same parameters and merged the results..still the
things are slow..

for one call i generally get around 500 to 1000 docs from solr..so basically
i am including following parameters in url for solr call

sort=created+desc
json.nl=map
wt=json
rows=1000
version=1.2
omitHeader=true
fl=title
start=0
q=apple
qt=standard
fq=created:[date1 TO date2]


Its taking long time to get the results,any solution for the above problem
would be great..

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-speed-issues-tp2254823p2254823.html
Sent from the Solr - User mailing list archive at Nabble.com.


Query : FAQ? Forum?

2011-01-14 Thread Cathy Hemsley
Hi,

I am trying to get Solr installed and working:  and have some queries:  is
there a FAQ or a Forum?  How do I search to see whether someone has already
asked my question and answered it?

Regards
Cathy


-- 
Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd.
Registration Number: 2416188

Registered in England and Wales.

Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

CONFIDENTIALITY : This e-mail and any attachments are confidential and may
be privileged. If you are not a named recipient, please notify the sender
immediately and do not disclose the contents to another person, use it for
any purpose or store or copy the information in any medium.

Please consider the environment before printing this e-mail

Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. 
Registration Number: 2416188 Registered in England and Wales. Registered 
office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

CONFIDENTIALITY : This e-mail and any attachments are confidential and may be 
privileged. 
If you are not a named recipient, please notify the sender immediately and do 
not disclose the contents to another person, use it for any purpose or store or 
copy the information in any medium.

http://www.converteam.com

Please consider the environment before printing this e-mail.


Re: Query : FAQ? Forum?

2011-01-14 Thread Stefan Matheis
What about http://search.lucidimagination.com/search/#/p:solr ? :)

On Fri, Jan 14, 2011 at 12:45 PM, Cathy Hemsley 
cathy.hems...@converteam.com wrote:

 Hi,

 I am trying to get Solr installed and working:  and have some queries:  is
 there a FAQ or a Forum?  How do I search to see whether someone has already
 asked my question and answered it?

 Regards
 Cathy


 --
 Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd.
 Registration Number: 2416188

 Registered in England and Wales.

 Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

 CONFIDENTIALITY : This e-mail and any attachments are confidential and may
 be privileged. If you are not a named recipient, please notify the sender
 immediately and do not disclose the contents to another person, use it for
 any purpose or store or copy the information in any medium.

 Please consider the environment before printing this e-mail

 Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd.
 Registration Number: 2416188 Registered in England and Wales. Registered
 office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

 CONFIDENTIALITY : This e-mail and any attachments are confidential and may
 be privileged.
 If you are not a named recipient, please notify the sender immediately and
 do not disclose the contents to another person, use it for any purpose or
 store or copy the information in any medium.

 http://www.converteam.com

 Please consider the environment before printing this e-mail.



boilerpipe solr tika howto please

2011-01-14 Thread arnaud gaudinat

Hello,

I would like to use BoilerPipe (a very good program which cleans the 
html content from surplus clutter).
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible 
from solr, am I right?


How I can Activate BoilerPipe in Solr? Do I need to change 
solrconfig.xml ( with 
org.apache.solr.handler.extraction.ExtractingRequestHandler)?


Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum 
(http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration) 
is it the right way?


Thanks in advance,

Arno.



Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-14 Thread Jörg Agatz
ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i
have a Oher Problem, i cant sent content to the Server..




when i will send Content to solr i get:

html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 400 /title
/head
bodyh2HTTP ERROR: 400/h2preDocument [null] missing required field:
id/pre
pRequestURI=/solr/update/extract/ppismalla href=
http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/

br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/

/body
/html


I do:
curl 
http://192.168.105.66:8983/solr/update/extract?ext.idx.attr=true\ext.def.fl=text;
-F myfile=@test.txt

some ideas?


Re: segment gets corrupted (after background merge ?)

2011-01-14 Thread Michael McCandless
Right, but removing a segment out from under a live IW (when you run
CheckIndex with -fix) is deadly, because that other IW doesn't know
you've removed the segment, and will later commit a new segment infos
still referencing that segment.

The nature of this particular exception from CheckIndex is very
strange... I think it can only be a bug in Lucene, a bug in the JRE or
a hardware issue (bits are flipping somewhere).

I don't think an error in the IO system can cause this particular
exception (it would cause others), because the deleted docs are loaded
up front when SegmentReader is init'd...

This is why I'd really like to see if a given corrupt index always
hits precisely the same exception if you run CheckIndex more than
once.

Mike

On Thu, Jan 13, 2011 at 10:56 PM, Lance Norskog goks...@gmail.com wrote:
 1) CheckIndex is not supposed to change a corrupt segment, only remove it.
 2) Are you using local hard disks, or do run on a common SAN or remote
 file server? I have seen corruption errors on SANs, where existing
 files have random changes.

 On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 Generally it's not safe to run CheckIndex if a writer is also open on the 
 index.

 It's not safe because CheckIndex could hit FNFE's on opening files,
 or, if you use -fix, CheckIndex will change the index out from under
 your other IndexWriter (which will then cause other kinds of
 corruption).

 That said, I don't think the corruption that CheckIndex is detecting
 in your index would be caused by having a writer open on the index.
 Your first CheckIndex has a different deletes file (_phe_p3.del, with
 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
 44828 deleted docs), so it must somehow have to do with that change.

 One question: if you have a corrupt index, and run CheckIndex on it
 several times in a row, does it always fail in the same way?  (Ie the
 same term hits the below exception).

 Is there any way I could get a copy of one of your corrupt cases?  I
 can then dig...

 Mike

 On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
 stephane.delp...@blogspirit.com wrote:
 I understand less and less what is happening to my solr.

 I did a checkIndex (without -fix) and there was an error...

 So a did another checkIndex with -fix and then the error was gone. The
 segment was alright


 During checkIndex I do not shut down the solr server, I just make sure no
 client connect to the server.

 Should I shut down the solr server during checkIndex ?



 first checkIndex :

  4 of 17: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p3.del]
    test: open reader.OK [44824 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
 seen 0 + num docs deleted 0]
 java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
 num docs deleted 0
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [7206878 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)


 a few minutes latter :

  4 of 18: name=_phe docCount=264148
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=928.977
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_phe_p4.del]
    test: open reader.OK [44828 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
 28919124 tokens]
    test: stored fields...OK [7206764 total field count; avg 32.86 fields
 per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]


 Le 12/01/2011 16:50, Michael McCandless a écrit :

 Curious... is it always a docFreq=1 != num docs seen 0 + 

Solr: using to index large folders recursively containing lots of different documents, and querying over the web

2011-01-14 Thread Cathy Hemsley
Hi Solr users,

I hope you can help.  We are migrating our intranet web site management
system to Windows 2008 and need a replacement for Index Server to do the
text searching.  I am trying to establish if Lucene and Solr is a feasible
replacement, but I cannot find the answers to these questions:

1. Can Solr be set up to recursively index a folder containing an
indeterminate and variable large number of subfolders, containing files of
all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations,
text files etc.  If so, how?
2. Can Solr be queried over the web and return a list of files that match a
search query entered by a user, and also return the abstracts for these
files, as well as 'hit highlighting'.  If so, how?
3. Can Solr be run as a service (like Index Server) that automatically
detects changes to the files within the indexed folder and updates the
index? If so, how?

Thanks for your help

Cathy Hemsley

-- 
Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd.
Registration Number: 2416188

Registered in England and Wales.

Registered office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

CONFIDENTIALITY : This e-mail and any attachments are confidential and may
be privileged. If you are not a named recipient, please notify the sender
immediately and do not disclose the contents to another person, use it for
any purpose or store or copy the information in any medium.

Please consider the environment before printing this e-mail

Converteam UK Ltd. Registration Number: 5571739 and Converteam Ltd. 
Registration Number: 2416188 Registered in England and Wales. Registered 
office: Boughton Road, Rugby, Warwickshire, CV21 1BU.

CONFIDENTIALITY : This e-mail and any attachments are confidential and may be 
privileged. 
If you are not a named recipient, please notify the sender immediately and do 
not disclose the contents to another person, use it for any purpose or store or 
copy the information in any medium.

http://www.converteam.com

Please consider the environment before printing this e-mail.


Re: Problem with Tika and ExtractingRequestHandler (How to from lucidimagination)

2011-01-14 Thread Stefan Matheis
pass an value for your id-field as you do it already for all the other
fields?

http://search.lucidimagination.com/search/document/ca95d06e700322ed/missing_required_field_id_using_extractingrequesthandler

On Fri, Jan 14, 2011 at 12:59 PM, Jörg Agatz joerg.ag...@googlemail.comwrote:

 ok, now in the 4 test, it works ? ok.. i dont know... it works.. but now i
 have a Oher Problem, i cant sent content to the Server..




 when i will send Content to solr i get:

 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 400 /title
 /head
 bodyh2HTTP ERROR: 400/h2preDocument [null] missing required field:
 id/pre
 pRequestURI=/solr/update/extract/ppismalla href=
 http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/

 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/

 /body
 /html


 I do:
 curl 

 http://192.168.105.66:8983/solr/update/extract?ext.idx.attr=true\ext.def.fl=text
 
 -F myfile=@test.txt

 some ideas?



Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

2011-01-14 Thread Markus Jelsma
Please visit the Nutch project. It is a powerful crawler and can integrate 
with Solr.

http://nutch.apache.org/

 Hi Solr users,
 
 I hope you can help.  We are migrating our intranet web site management
 system to Windows 2008 and need a replacement for Index Server to do the
 text searching.  I am trying to establish if Lucene and Solr is a feasible
 replacement, but I cannot find the answers to these questions:
 
 1. Can Solr be set up to recursively index a folder containing an
 indeterminate and variable large number of subfolders, containing files of
 all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint presentations,
 text files etc.  If so, how?
 2. Can Solr be queried over the web and return a list of files that match a
 search query entered by a user, and also return the abstracts for these
 files, as well as 'hit highlighting'.  If so, how?
 3. Can Solr be run as a service (like Index Server) that automatically
 detects changes to the files within the indexed folder and updates the
 index? If so, how?
 
 Thanks for your help
 
 Cathy Hemsley


Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

2011-01-14 Thread Toke Eskildsen
On Fri, 2011-01-14 at 13:05 +0100, Cathy Hemsley wrote:
 I hope you can help.  We are migrating our intranet web site management
 system to Windows 2008 and need a replacement for Index Server to do the
 text searching.  I am trying to establish if Lucene and Solr is a feasible
 replacement, but I cannot find the answers to these questions:

The answers to your questions are yes and no to all of them. Solr does
not do what you ask out of the box, but it can certainly be done by
extending Solr or using it as at the core of another system.

Some time ago I stumbled upon http://www.constellio.com/ which seems to
be exactly what you're looking for.



Re: Adding a new site to existing solr configuration

2011-01-14 Thread PeterKerk

Awesome! thx! :)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Adding-a-new-site-to-existing-solr-configuration-tp2249223p2255160.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

2011-01-14 Thread Markus Jelsma
Nutch can crawl the file system as well. Nutch 1.x can also provide search but 
this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch 
can provide Solr with content from your intranet.

On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote:
 Hi,
 Thanks for suggesting this.
 However, I'm not sure a 'crawler' will work:  as the various pages are not
 necessarily linked (it's complicated:  basically our intranet is a dynamic
 and managed collection of independantly published web sites, and users
 found information using categorisation and/or text searching), so we need
 something that will index all the files in a given folder, rather than
 follow links like a crawler. Can Nutch do this? As well as the other
 requirements below?
 Regards
 Cathy
 
 On 14 January 2011 12:09, Markus Jelsma markus.jel...@openindex.io wrote:
  Please visit the Nutch project. It is a powerful crawler and can
  integrate with Solr.
  
  http://nutch.apache.org/
  
   Hi Solr users,
   
   I hope you can help.  We are migrating our intranet web site management
   system to Windows 2008 and need a replacement for Index Server to do
   the text searching.  I am trying to establish if Lucene and Solr is a
  
  feasible
  
   replacement, but I cannot find the answers to these questions:
   
   1. Can Solr be set up to recursively index a folder containing an
   indeterminate and variable large number of subfolders, containing files
  
  of
  
   all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint
   presentations, text files etc.  If so, how?
   2. Can Solr be queried over the web and return a list of files that
   match
  
  a
  
   search query entered by a user, and also return the abstracts for these
   files, as well as 'hit highlighting'.  If so, how?
   3. Can Solr be run as a service (like Index Server) that automatically
   detects changes to the files within the indexed folder and updates the
   index? If so, how?
   
   Thanks for your help
   
   Cathy Hemsley

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Is deduplication possible during Tika extract?

2011-01-14 Thread arnaud gaudinat

Hello,

here is an excerpt of my solrconfig.xml:

requestHandler name=/update/extract 
class=org.apache.solr.handler.extraction.ExtractingRequestHandler 
startup=lazy

lst name=defaults

str name=update.processordedupe/str

!-- All the main content goes into text... if you need to return
   the extracted text or do highlighting, use a stored field. --
str name=fmap.contenttext/str
str name=lowernamestrue/str
str name=uprefixignored_/str

!-- capture link hrefs but ignore div attributes --
str name=captureAttrtrue/str
str name=fmap.alinks/str
str name=fmap.divignored_/str
/lst
/requestHandler

and

updateRequestProcessorChain name=dedupe
processor 
class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory

bool name=enabledtrue/bool
str name=signatureFieldsignature/str
bool name=overwriteDupesfalse/bool
str name=fieldstext/str
str 
name=signatureClassorg.apache.solr.update.processor.TextProfileSignature/str

/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain

deduplication works when I use only /update but not when solr does an 
extract with Tika!

Is deduplication possible during Tika extract?

Thanks in advance,
Arno



Re: segment gets corrupted (after background merge ?)

2011-01-14 Thread Stéphane Delprat

So I ran checkIndex (without -fix) 5 times in a row :

SOLR was running, but no client connected to it. (just the slave which 
was synchronizing every 5 minutes)


summary :

1: all good
2: 2 errors: (seg 1  2) terms, freq, prox...ERROR [term blog_id:104150: 
doc 324697 = lastDoc 324697]  terms, freq, prox...ERROR [term 
blog_keywords:SPORT: doc 174808 = lastDoc 174808]

3: 1 error: (seg 2) terms, freq, prox...ERROR [Index: 105, Size: 51]
4: all good
5: 1 error: (seg 7) terms, freq, prox...ERROR [term blog_comments: %X 
docFreq=1 != num docs seen 0 + num docs deleted 0]


Seams to me that some random things are happening here.

File system is ext3, on a physical server.


Here are the logs of the interesting segments :

** 1 **

  1 of 17: name=_nqt docCount=431889
compound=false
hasProx=true
numFiles=9
size (MB)=1,671.375
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_nqt_1y2.del]
test: open reader.OK [41918 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs 
pairs; 59357374 tokens]
test: stored fields...OK [11505678 total field count; avg 
29.504 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


  2 of 17: name=_ol7 docCount=913886
compound=false
hasProx=true
numFiles=9
size (MB)=3,567.739
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_ol7_1mc.del]
test: open reader.OK [74076 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs 
pairs; 132337348 tokens]
test: stored fields...OK [26933113 total field count; avg 32.07 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


** 2 **

  1 of 17: name=_nqt docCount=431889
compound=false
hasProx=true
numFiles=9
size (MB)=1,671.375
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0

_20, java.vendor=Sun Microsystems Inc.}
has deletions [delFileName=_nqt_1y2.del]
test: open reader.OK [41918 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term blog_id:104150: doc 324697 = 
lastDoc 324697]
java.lang.RuntimeException: term blog_id:104150: doc 324697 = lastDoc 
324697
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [11505678 total field count; avg 
29.504 fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]

FAILED
WARNING: fixIndex() would remove reference to this segment; full 
exception:

java.lang.RuntimeException: Term Index test failed
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

  2 of 17: name=_ol7 docCount=913886
compound=false
hasProx=true
numFiles=9
size (MB)=3,567.739
diagnostics = {optimize=false, mergeFactor=10, 
os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, 
lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, 
os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.}

has deletions [delFileName=_ol7_1mc.del]
test: open reader.OK [74076 deleted docs]
test: fields..OK [51 fields]
test: field norms.OK [51 fields]
test: terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 
174808 = lastDoc 174808]
java.lang.RuntimeException: term blog_keywords:SPORT: doc 174808 = 
lastDoc 174808
at 
org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
at 
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)

at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
test: stored fields...OK [26933113 total field count; avg 32.07 
fields per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq 
vector fields per doc]


LukeRequestHandler histogram?

2011-01-14 Thread Bernd Fehling
Dear list,

what is the LukeRequestHandler histogram telling me?

Couldn't find any explanation and would be pleased to have it explained.

Many thanks in advance,
Bernd



Re: LukeRequestHandler histogram?

2011-01-14 Thread Stefan Matheis
Hi Bernd,

there is an explanation from Hoss:
http://search.lucidimagination.com/search/document/149e7d25415c0a36/some_kind_of_crazy_histogram#b22563120f1ec32b

HTH
Stefan

On Fri, Jan 14, 2011 at 3:15 PM, Bernd Fehling 
bernd.fehl...@uni-bielefeld.de wrote:

 Dear list,

 what is the LukeRequestHandler histogram telling me?

 Couldn't find any explanation and would be pleased to have it explained.

 Many thanks in advance,
 Bernd




Re: LukeRequestHandler histogram?

2011-01-14 Thread Bernd Fehling
Hi Stefan,

thanks a lot.

Regards,
Bernd


Am 14.01.2011 15:25, schrieb Stefan Matheis:
 Hi Bernd,
 
 there is an explanation from Hoss:
 http://search.lucidimagination.com/search/document/149e7d25415c0a36/some_kind_of_crazy_histogram#b22563120f1ec32b
 
 HTH
 Stefan
 
 On Fri, Jan 14, 2011 at 3:15 PM, Bernd Fehling 
 bernd.fehl...@uni-bielefeld.de wrote:
 
 Dear list,

 what is the LukeRequestHandler histogram telling me?

 Couldn't find any explanation and would be pleased to have it explained.

 Many thanks in advance,
 Bernd


 


Re: Query : FAQ? Forum?

2011-01-14 Thread kenf_nc

http://wiki.apache.org/solr/FrontPage Solr Wiki 
http://wiki.apache.org/solr/FAQ Solr FAQ 
http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881/ref=sr_1_1?ie=UTF8qid=1295018231sr=8-1
A good book on Solr 

And this forum you posted to 
http://lucene.472066.n3.nabble.com/Solr-User-f472068.html (Solr-User)  is
one of the most active and useful Tech forums I've ever used. Don't be
afraid to ask stupid questions, folks here are pretty forgiving and patient,
especially if you attempt to use the Wiki or FAQ first.

Good Luck!
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-FAQ-Forum-tp2254898p2256030.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: boilerpipe solr tika howto please

2011-01-14 Thread Adam Estrada
Is there a drastic difference between this and TagSoup which is already
included in Solr?

On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
arnaud.gaudi...@gmail.comwrote:

 Hello,

 I would like to use BoilerPipe (a very good program which cleans the html
 content from surplus clutter).
 I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
 solr, am I right?

 How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
 with org.apache.solr.handler.extraction.ExtractingRequestHandler)?

 Or do I need to modify some code inside Solr?

 I so something like TikaCLI -F in the tika forum (
 http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
 is it the right way?

 Thanks in advance,

 Arno.




Re: boilerpipe solr tika howto please

2011-01-14 Thread arnaud gaudinat
I just saw TagSoup and it seems to clean bad HTML tags to create a good 
HTML file.
what's BoilerPipe does, it try to eliminate html content which is not 
part of the useful content for a human reader (ie. navigation contents, 
ads, comments...)
take a look here: http://boilerpipe-web.appspot.com/ and try with one of 
your URL


And other type of this application, is 'Readability' which is more for a 
end-user (http://lab.arc90.com/experiments/readability/)



Le 14.01.2011 16:51, Adam Estrada a écrit :

Is there a drastic difference between this and TagSoup which is already
included in Solr?

On Fri, Jan 14, 2011 at 6:57 AM, arnaud gaudinat
arnaud.gaudi...@gmail.comwrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the html
content from surplus clutter).
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible from
solr, am I right?

How I can Activate BoilerPipe in Solr? Do I need to change solrconfig.xml (
with org.apache.solr.handler.extraction.ExtractingRequestHandler)?

Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (
http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration)
is it the right way?

Thanks in advance,

Arno.






Re: Improving Solr performance

2011-01-14 Thread Gora Mohanty
On Fri, Jan 14, 2011 at 1:56 PM, supersoft elarab...@gmail.com wrote:

 The tests are performed with a selfmade program.
[...]

May I ask in what language is the program written in? The reason to
ask that is to eliminate the possibility that there is an issue with the
threading model, e.g., if you were using Python, for example.

Would it be possible for you to run Apache bench, ab, against
your Solr setup, e.g., something like:

# For 10 simultaneous connections
ab -n 100 -c 10 http://localhost:8983/solr/select/?q=my_query1

# For 50 simultaneous connections
ab -n 500 -c 50 http://localhost:8983/solr/select/?q=my_query2

Please pay attention to the meaning of the -n parameter (there
is a slight gotcha there). man ab for details on usage, or see,
http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
for example.

 In the last post, I wrote the results of the 100 threads example orderered
 by the response date. The results ordered by the creation date are:
[...]

OK, the numbers makes more sense now.

As someone else has pointed out, your throughput does increase
with more simultaneous queries, and there are better ways to do
the measurement. Nevertheless, your results are very much at odds
with what we see, and I would like to understand the issue.

Regards,
Gora


Re: boilerpipe solr tika howto please

2011-01-14 Thread Ken Krugler

Hi Arno,

On Jan 14, 2011, at 3:57am, arnaud gaudinat wrote:


Hello,

I would like to use BoilerPipe (a very good program which cleans the  
html content from surplus clutter).
I saw that BoilerPipe is inside Tika 0.8 and so should be accessible  
from solr, am I right?


How I can Activate BoilerPipe in Solr? Do I need to change  
solrconfig.xml ( with  
org.apache.solr.handler.extraction.ExtractingRequestHandler)?


Or do I need to modify some code inside Solr?

I so something like TikaCLI -F in the tika forum (http://www.lucidimagination.com/search/document/242ce3a17f30f466/boilerpipe_integration 
) is it the right way?


You need to add the BoilerpipeContentHandler into Tika's content  
handler chain.


Which I'm pretty sure means you'd need to modify Solr, e.g. (in trunk)  
the TikaEntityProcessor.getHtmlHandler() method. I'd try something like:


return new BoilerpipeContentHandler(new ContentHandlerDecorator(

Though from a quick look at that code, I'm curious why it doesn't use  
BodyContentHandler, versus the current ContentHandlerDecorator.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: Variable datasources

2011-01-14 Thread tjpoe

I was actually able to figure this out using a slightly different method

since the databases exist on the same server I simply made a single
datasource with no database selected:

datasource url=jdbc:mysql://localhost/ name=content / 

then in the queries, I qualify using the full database notation:
database.table rather than just table

document name=items 
 entity datasource=content name=local query=select code from
master.locals  rootEntity=false 
  entity datasource=content name=item query= select *,  ${local.code}
as code from content_${local.code}.item / 
  /entity 
/document 

it works as expected
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Variable-datasources-tp2249568p2257334.html
Sent from the Solr - User mailing list archive at Nabble.com.


No system property or default value specified for...

2011-01-14 Thread Tanner Postert
I'm trying to dynamically add a core to a multi core system using the
following command:

http://localhost:8983/solr/admin/cores?action=CREATEname=itemsinstanceDir=itemsconfig=data-config.xmlschema=schema.xmldataDir=datapersist=true

the data-config.xml looks like this:

dataConfig
  dataSource type=JdbcDataSource
   url=jdbc:mysql://localhost/
   ...
   name=server/
  document name=items
   entity dataSource=server name=locals
   query=select code from master.locals
   rootEntity=false
entity dataSource=server name=item
query=select '${local.code}' as localcode,
items.*
FROM ${local.code}_meta.item
WHERE
  item.lastmodified  '${dataimporter.last_index_time}'
OR
  '${dataimporter.request.clean}' != 'false'
order by item.objid
/
/entity
/document
/dataConfig

this same configuration works for a core that is already imported into the
system, but when trying to add the core with the above command I get the
following error:

No system property or default value specified for local.code

so I added a property/ tag in the solr.xml figuring that it needed some
type of default value for this to work, then I restarted solr, but now when
I try the import I get:

No system property or default value specified for
dataimporter.last_index_time

Do I have to define a default value for every variable I will conceivably
use for future cores? is there a way to bypass this error?

Thanks in advance


Re: segment gets corrupted (after background merge ?)

2011-01-14 Thread Michael McCandless
OK given that you're seeing non-deterministic results on the same
index... I think this is likely a hardware issue or a JRE bug?

If you move that index over to another env and run CheckIndex, is it consistent?

Mike

On Fri, Jan 14, 2011 at 9:00 AM, Stéphane Delprat
stephane.delp...@blogspirit.com wrote:
 So I ran checkIndex (without -fix) 5 times in a row :

 SOLR was running, but no client connected to it. (just the slave which was
 synchronizing every 5 minutes)

 summary :

 1: all good
 2: 2 errors: (seg 1  2) terms, freq, prox...ERROR [term blog_id:104150: doc
 324697 = lastDoc 324697]  terms, freq, prox...ERROR [term
 blog_keywords:SPORT: doc 174808 = lastDoc 174808]
 3: 1 error: (seg 2) terms, freq, prox...ERROR [Index: 105, Size: 51]
 4: all good
 5: 1 error: (seg 7) terms, freq, prox...ERROR [term blog_comments: %X
 docFreq=1 != num docs seen 0 + num docs deleted 0]

 Seams to me that some random things are happening here.

 File system is ext3, on a physical server.


 Here are the logs of the interesting segments :

 ** 1 **

  1 of 17: name=_nqt docCount=431889
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=1,671.375
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_nqt_1y2.del]
    test: open reader.OK [41918 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [5211271 terms; 39824029 terms/docs pairs;
 59357374 tokens]
    test: stored fields...OK [11505678 total field count; avg 29.504
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]

  2 of 17: name=_ol7 docCount=913886
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=3,567.739
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ol7_1mc.del]
    test: open reader.OK [74076 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...OK [9825896 terms; 93954470 terms/docs pairs;
 132337348 tokens]
    test: stored fields...OK [26933113 total field count; avg 32.07
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]

 ** 2 **

  1 of 17: name=_nqt docCount=431889
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=1,671.375
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
 _20, java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_nqt_1y2.del]
    test: open reader.OK [41918 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term blog_id:104150: doc 324697 =
 lastDoc 324697]
 java.lang.RuntimeException: term blog_id:104150: doc 324697 = lastDoc
 324697
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
    test: stored fields...OK [11505678 total field count; avg 29.504
 fields per doc]
    test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]
 FAILED
    WARNING: fixIndex() would remove reference to this segment; full
 exception:
 java.lang.RuntimeException: Term Index test failed
        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)

  2 of 17: name=_ol7 docCount=913886
    compound=false
    hasProx=true
    numFiles=9
    size (MB)=3,567.739
    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
 os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
 java.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_ol7_1mc.del]
    test: open reader.OK [74076 deleted docs]
    test: fields..OK [51 fields]
    test: field norms.OK [51 fields]
    test: terms, freq, prox...ERROR [term blog_keywords:SPORT: doc 174808 =
 lastDoc 174808]
 java.lang.RuntimeException: term blog_keywords:SPORT: doc 174808 = lastDoc
 174808
        at
 org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:644)
        at 

DataImportHandler: full import of a single entity

2011-01-14 Thread Jon Drukman
I've got a DataImportHandler set up with 5 entities.  I would like to do a full
import on just one entity.  Is that possible?

I worked around it temporarily by hand editing the dataimport.properties file
and deleting the delta line for that one entity, and kicking off a delta.  But
for (hopefully) obvious reasons, delta is less efficient than full.

-jsd-



MaxRows and disabling sort

2011-01-14 Thread Salman Akram
Hi,

I want to limit my SOLR results so that it stops further searching once it
founds a certain number of records (just like 'limit' in MySQL).

I know it has timeAllowed property but is there anything like MaxRows? I am
NOT talking about 'rows' attribute which returns a specific no. of rows to
client. This seems a very nice way to stop SOLR from traversing through the
complete index but I am not sure if there is anything like this.

Also I guess default sorting is on Scoring and sorting can only be done once
it has the scores of all matches so then limiting it to the max rows becomes
useless. So if there a way to disable sorting? e.g. it returns the rows as
it finds without any order?

Thanks!


-- 
Regards,

Salman Akram
Cell: +92-321-4391210


Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-14 Thread Erick Erickson
This might work:

Define your field to use WhitespaceTokenizer and LowerCaseFilterFactory

Use a filter query referencing this field.

If you wanted the words to appear in their exact order, you could just
define
the pf field in your dismax.

Best
Erick

On Thu, Jan 13, 2011 at 8:01 PM, Estrada Groups 
estrada.adam.gro...@gmail.com wrote:

 Ahhh...the fun of open source software ;-). Requires a ton of trial and
 error! I found what worked for me and figured it was worth passing it along.
 If you don't mind...when you sort everything out on your end, please post
 results for the rest of us to take a gander at.

 Cheers,
 Adam

 On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com
 wrote:

  Thanks for your reply. However, it doesn't work for my case at all. I
 think
  it's the problem with query parser or something else. It forces me to put
  double quote to the search query in order to get the results found.
 
  str name=rawquerystringsim 010/str
  str name=querystringsim 010/str
  str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010))
 ()/str
  str name=parsedquery_toString+(keyphrase:sim 010) ()/str
 
  str name=rawquerystringsmart mobile/str
  str name=querystringsmart mobile/str
  str name=parsedquery
  +((DisjunctionMaxQuery((keyphrase:smart))
  DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
  /str
  str name=parsedquery_toString+(((keyphrase:smart)
 (keyphrase:mobile))~2)
  ()/str
 
  The intent here is to do a full text search, part of that is to search
  keyword field, so I can't put quote to it.
 
  On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada 
  estrada.adam.gro...@gmail.com wrote:
 
  Hi,
 
  the following seems to work pretty well.
 
fieldType name=text_ws class=solr.TextField
  positionIncrementGap=100
  analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ShingleFilterFactory
  maxShingleSize=4 outputUnigrams=true
  outputUnigramIfNoNgram=false /
  /analyzer
/fieldType
 
!-- A text field that uses WordDelimiterFilter to enable splitting
 and
  matching of
words on case-change, alpha numeric boundaries, and
 non-alphanumeric
  chars,
so that a query of wifi or wi fi could match a document
  containing Wi-Fi.
Synonyms and stopwords are customized by external files, and
  stemming is enabled.
The attribute autoGeneratePhraseQueries=true (the default)
 causes
  words that get split to
form phrase queries. For example, WordDelimiterFilter splitting
  text:pdp-11 will cause the parser
to generate text:pdp 11 rather than (text:PDP OR text:11).
NOTE: autoGeneratePhraseQueries=true tends to not work well for
  non whitespace delimited languages.
--
fieldType name=text class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType
 
copyField source=cat dest=text/
copyField source=subject dest=text/
copyField source=summary dest=text/
copyField source=cause dest=text/
copyField source=status dest=text/
copyField source=urgency dest=text/
 
  I ingest the source fields as text_ws (I know I've changed it a bit) and
  then copy the field to text. This seems to do what you are asking for.
 
  

Re: solr speed issues..

2011-01-14 Thread Erick Erickson
You haven't given us much information here, it might help to review:
http://wiki.apache.org/solr/UsingMailingLists

In addition to Kenf_nc's  comments, your sorting may be
an issue, especially if you're measuring the first query
times.

What does debugQuery=on show? How many docs in your index?
How much RAM are you allocating to the JVM? Have you looked
at your cache statistics on the admin page?

Best
Erick

On Fri, Jan 14, 2011 at 3:26 AM, saureen saureen_ad...@yahoo.co.in wrote:


 I am working on an application that requires fetching results from solr
 based
 on date parameter..earlier i was using sharding to fetch the results but
 that was making things too slow,so instead of sharding,i queried on three
 different cores with the same parameters and merged the results..still the
 things are slow..

 for one call i generally get around 500 to 1000 docs from solr..so
 basically
 i am including following parameters in url for solr call

 sort=created+desc
 json.nl=map
 wt=json
 rows=1000
 version=1.2
 omitHeader=true
 fl=title
 start=0
 q=apple
 qt=standard
 fq=created:[date1 TO date2]


 Its taking long time to get the results,any solution for the above problem
 would be great..

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-speed-issues-tp2254823p2254823.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: MaxRows and disabling sort

2011-01-14 Thread Erick Erickson
Why do you want to do this? That is, what problem do you think would
be solved by this? Because there are other problems if you're trying to,
say, return all rows that match

But no, there's nothing that I know of that would do what you want (of
course that doesn't mean there isn't).

Best
Erick

On Fri, Jan 14, 2011 at 12:17 PM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 Hi,

 I want to limit my SOLR results so that it stops further searching once it
 founds a certain number of records (just like 'limit' in MySQL).

 I know it has timeAllowed property but is there anything like MaxRows? I am
 NOT talking about 'rows' attribute which returns a specific no. of rows to
 client. This seems a very nice way to stop SOLR from traversing through the
 complete index but I am not sure if there is anything like this.

 Also I guess default sorting is on Scoring and sorting can only be done
 once
 it has the scores of all matches so then limiting it to the max rows
 becomes
 useless. So if there a way to disable sorting? e.g. it returns the rows as
 it finds without any order?

 Thanks!


 --
 Regards,

 Salman Akram
 Cell: +92-321-4391210



Re: MaxRows and disabling sort

2011-01-14 Thread Salman Akram
In some cases my search takes too long. Now I want to show user partial
matches if its taking too long.

The problem with timeAllowed is that lets say I set its value to 10 secs
then for some queries it would be fine and will at least return few hundred
rows but in really worse scenarios it might not even return few records in
that time (even 0 is highly possible) so the user would think nothing
matched though there were many matches.

Telling SOLR to return first 20/50 records would ensure that it will at
least return user the first page even if it takes more time.

On Sat, Jan 15, 2011 at 3:11 AM, Erick Erickson erickerick...@gmail.comwrote:

 Why do you want to do this? That is, what problem do you think would
 be solved by this? Because there are other problems if you're trying to,
 say, return all rows that match

 But no, there's nothing that I know of that would do what you want (of
 course that doesn't mean there isn't).

 Best
 Erick

 On Fri, Jan 14, 2011 at 12:17 PM, Salman Akram 
 salman.ak...@northbaysolutions.net wrote:

  Hi,
 
  I want to limit my SOLR results so that it stops further searching once
 it
  founds a certain number of records (just like 'limit' in MySQL).
 
  I know it has timeAllowed property but is there anything like MaxRows? I
 am
  NOT talking about 'rows' attribute which returns a specific no. of rows
 to
  client. This seems a very nice way to stop SOLR from traversing through
 the
  complete index but I am not sure if there is anything like this.
 
  Also I guess default sorting is on Scoring and sorting can only be done
  once
  it has the scores of all matches so then limiting it to the max rows
  becomes
  useless. So if there a way to disable sorting? e.g. it returns the rows
 as
  it finds without any order?
 
  Thanks!
 
 
  --
  Regards,
 
  Salman Akram
  Cell: +92-321-4391210
 




-- 
Regards,

Salman Akram
Senior Software Engineer - Tech Lead
80-A, Abu Bakar Block, Garden Town, Pakistan
Cell: +92-321-4391210


Re: MaxRows and disabling sort

2011-01-14 Thread Chris Hostetter

: Also I guess default sorting is on Scoring and sorting can only be done once
: it has the scores of all matches so then limiting it to the max rows becomes
: useless. So if there a way to disable sorting? e.g. it returns the rows as
: it finds without any order?

http://wiki.apache.org/solr/CommonQueryParameters#sort
You can sort by index id using sort=_docid_ asc or sort=_docid_ desc

if you specify _docid_ asc then solr should return as soon as it finds the 
first N matching results w/o scoring all docs (because no score will be 
computed)

if you use any complex features however (faceting or what not) then it 
will still most likely need to scan all docs.


-Hoss


Re: Multi-word exact keyword case-insensitive search suggestions

2011-01-14 Thread Chamnap Chhorn
Ahh, thanks guys for helping me!

For Adam solution, it doesn't work for me. Here is my Field, FieldType, and
solr query:

fieldType name=text_keyword class=solr.TextField
positionIncrementGap=100
   analyzer
   tokenizer class=solr.KeywordTokenizerFactory /
   filter class=solr.ShingleFilterFactory
 maxShingleSize=4 outputUnigrams=true
outputUnigramIfNoNgram=false /
 /analyzer
/fieldType

field name=keyphrase type=text_keyword indexed=true stored=false
multiValued=true/

http://localhost:8081/solr/select?q=printing%20houseqf=keyphrasedebugQuery=ondefType=dismax

str name=parsedquery
+((DisjunctionMaxQuery((keyphrase:smart))
DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
/str
str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2)
()/str
lst name=explain/

The result is not found.

For erick solution, it works for me. However, I can't put filter query,
since it's part of full text search. If I put fq, it would just return
documents that match exactly as the query. I want to show those that match
exactly on the top and the rest for documents that match partially.

The problem is that when the user search a word (eg. printing of the
keyword printing house), that document also include in the search results.
The other problem is that if the user search the reverse order(eg. house
printing), it's also found.

Cheers

On Sat, Jan 15, 2011 at 3:31 AM, Erick Erickson erickerick...@gmail.comwrote:

 This might work:

 Define your field to use WhitespaceTokenizer and LowerCaseFilterFactory

 Use a filter query referencing this field.

 If you wanted the words to appear in their exact order, you could just
 define
 the pf field in your dismax.

 Best
 Erick

 On Thu, Jan 13, 2011 at 8:01 PM, Estrada Groups 
 estrada.adam.gro...@gmail.com wrote:

  Ahhh...the fun of open source software ;-). Requires a ton of trial and
  error! I found what worked for me and figured it was worth passing it
 along.
  If you don't mind...when you sort everything out on your end, please post
  results for the rest of us to take a gander at.
 
  Cheers,
  Adam
 
  On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com
  wrote:
 
   Thanks for your reply. However, it doesn't work for my case at all. I
  think
   it's the problem with query parser or something else. It forces me to
 put
   double quote to the search query in order to get the results found.
  
   str name=rawquerystringsim 010/str
   str name=querystringsim 010/str
   str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010))
  ()/str
   str name=parsedquery_toString+(keyphrase:sim 010) ()/str
  
   str name=rawquerystringsmart mobile/str
   str name=querystringsmart mobile/str
   str name=parsedquery
   +((DisjunctionMaxQuery((keyphrase:smart))
   DisjunctionMaxQuery((keyphrase:mobile)))~2) ()
   /str
   str name=parsedquery_toString+(((keyphrase:smart)
  (keyphrase:mobile))~2)
   ()/str
  
   The intent here is to do a full text search, part of that is to search
   keyword field, so I can't put quote to it.
  
   On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada 
   estrada.adam.gro...@gmail.com wrote:
  
   Hi,
  
   the following seems to work pretty well.
  
 fieldType name=text_ws class=solr.TextField
   positionIncrementGap=100
   analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.ShingleFilterFactory
   maxShingleSize=4 outputUnigrams=true
   outputUnigramIfNoNgram=false /
   /analyzer
 /fieldType
  
 !-- A text field that uses WordDelimiterFilter to enable splitting
  and
   matching of
 words on case-change, alpha numeric boundaries, and
  non-alphanumeric
   chars,
 so that a query of wifi or wi fi could match a document
   containing Wi-Fi.
 Synonyms and stopwords are customized by external files, and
   stemming is enabled.
 The attribute autoGeneratePhraseQueries=true (the default)
  causes
   words that get split to
 form phrase queries. For example, WordDelimiterFilter splitting
   text:pdp-11 will cause the parser
 to generate text:pdp 11 rather than (text:PDP OR text:11).
 NOTE: autoGeneratePhraseQueries=true tends to not work well
 for
   non whitespace delimited languages.
 --
 fieldType name=text class=solr.TextField
  positionIncrementGap=100
   autoGeneratePhraseQueries=true
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 !-- in this example, we will only use synonyms at query time
 filter class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true expand=false/
 --
 !-- Case insensitive stop word removal.
   add enablePositionIncrements=true in both the index and query
   analyzers to leave a 'gap' for more accurate phrase queries.
 --
 filter class=solr.StopFilterFactory
 ignoreCase=true
 words=stopwords.txt