date:20100810

 hi everyone,
 
 I do these steps every time the new xml file created (for
 example
 cat_978.xml has just been created):
 1. delete the index
 (deletequeryAUC_CAT:978/query/delete)
 2. commit the new cat_978.xml (java -jar post.jar
 cat_978.xml)
 3. restart the java (stop and java -jar start.jar)
 
 if I'm not done those steps then the query result showed in
 the browser
 still using the old value (cat_978.xml - no changes at all)
 instead of
 reading the new cat_978.xml
 
 what I want to ask, is there a way so I don't need to
 restart the java since
 it consume too much resources and time?

You dont need to delete old document. Solr replaces it automaticaly. Assuming 
they have same uniqueKey. 

Probably HTTP caching causing you problems when testing with browser. You can 
disable it in solrconfig.xml file httpCaching never304=true

Re: how to support implicit trailing wildcards

2010-08-10 Thread Geert-Jan Brits

you could satisfy this by making 2 fields:
1. exactmatch
2. wildcardmatch

use copyfield in your schema to copy 1 -- 2 .

q=exactmatch:mount+wildcardmatch:mount*q.op=OR
this would score exact matches above (solely) wildcard matches

Geert-Jan

2010/8/10 yandong yao yydz...@gmail.com

Hi Bastian,

Sorry for not make it clear, I also want exact match have higher score than
wildcard match, that is means: if searching 'mount', documents with 'mount'
will have higher score than documents with 'mountain', while 'mount*' seems
treat 'mount' and 'mountain' as same.

besides, also want the query to be processed with analyzer, while from

http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
,
Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
The
rationale is that if search 'mounted', I also want documents with 'mount'
match.

So seems built-in wildcard search could not satisfy my requirements if i
understand correctly.

Thanks very much!

2010/8/9 Bastian Spitzer bspit...@magix.net

Wildcard-Search is already built in, just use:

?q=umoun*
?q=mounta*

-Ursprüngliche Nachricht-
Von: yandong yao [mailto:yydz...@gmail.com]
Gesendet: Montag, 9. August 2010 15:57
An: solr-user@lucene.apache.org
Betreff: how to support implicit trailing wildcards

Hi everyone,

How to support 'implicit trailing wildcard *' using Solr, eg: using
Google
to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
index
size increases dramatically, b) will matches even has no relationship,
such
as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through
TermsComponent using given keyword, then using the first matched term
from
term dictionary to search again. eg: when user enter 'umoun',
TermsComponent
will match 'umount', then use 'umount' to search. The disadvantage are:
a)
need to parse query string so that could recognize meta keywords such as
'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP
client), b) The returned hit counts is not for original search string,
thus
will influence other components such as auto-suggest component based on
user
search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start
with.

Is there any other way in Solr to do this, any feedback/suggestion are
welcome!

Thanks very much in advance!

Re: solr query result not read the latest xml file


I already set in my solrconfig.xml as you told me:
httpCaching never304=false/httpCaching

and then I commit the xml
and it's still not working
the query result still show the old data :(

do you have any suggestion?

Eben
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068647.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr query result not read the latest xml file



 I already set in my solrconfig.xml as you told me:
 httpCaching never304=false/httpCaching
 
 and then I commit the xml
 and it's still not working
 the query result still show the old data :(
 
 do you have any suggestion?

Shouldn't it be never304=true? You wrote never304=false

Additionally cant you try with something else than browser, curl, wget etc.

AW: solr query result not read the latest xml file

2010-08-10 Thread Bastian Spitzer

make sure you send a commit/ after add/delete to make the changes visible.

-Ursprüngliche Nachricht-
Von: e8en [mailto:e...@tokobagus.com] 
Gesendet: Dienstag, 10. August 2010 10:04
An: solr-user@lucene.apache.org
Betreff: Re: solr query result not read the latest xml file


I already set in my solrconfig.xml as you told me:
httpCaching never304=false/httpCaching

and then I commit the xml
and it's still not working
the query result still show the old data :(

do you have any suggestion?

Eben
--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068647.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr Delta import where last_modified

2010-08-10 Thread Hando420


Hi all. I have set my data-config with mysql database. The problem i am
having is mysql doesn't execute deltaquery. The where last_modified is not
executed and throws an error of unknown column last_modified in where
clause. Shouldn't this be treated as a deltaquery instead of a column in
table. Am i missing any configurations. Highly appreciate for any feedback
about this.

Hando
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Delta-import-where-last-modified-tp1068743p1068743.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr query result not read the latest xml file


yes I try with both value, never304=true and never304=false and none of
them make it works
what is curl and wget?
I use mozilla firefox browser
I'm really newbie in programming world especially solr
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068751.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: AW: solr query result not read the latest xml file


hi Bastian,
how to send a commit/?
is it by typing : java -jar post.jar cat_978.xml?

if yes then I've already done that
any solution please?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068782.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr query result not read the latest xml file

 yes I try with both value, never304=true and
 never304=false and none of
 them make it works

It must be httpCaching never304=true  /httpCaching, so lets forget about 
never304=false. But when you change something in solrconfig.xml you need to 
restart jetty/tomcat.

java -jar post.jar *.xml does /commit by default at the end.

 what is curl and wget?

They are command line tools.

 I use mozilla firefox browser
 I'm really newbie in programming world especially solr

May be you can configure firefox to disable caches.

AW: AW: solr query result not read the latest xml file

2010-08-10 Thread Bastian Spitzer

you can check the admin panel to see if there are pending deletes/commits in 
the statistics section.
older versions of post.jar dont auto-commit the changes, so if your xml doesnt 
contain a commit/ 
you could just create a commit.xml containing only the following:

commit/

and send it via post.jar. you can also curl it or whatever u like:

curl http://hostname:port/solr/update -H Content-Type: text/xml 
--data-binary 'commit/'

-Ursprüngliche Nachricht-
Von: e8en [mailto:e...@tokobagus.com] 
Gesendet: Dienstag, 10. August 2010 10:22
An: solr-user@lucene.apache.org
Betreff: Re: AW: solr query result not read the latest xml file


hi Bastian,
how to send a commit/?
is it by typing : java -jar post.jar cat_978.xml?

if yes then I've already done that
any solution please?
--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1068782.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr query result not read the latest xml file


finally I found out the cause of my problem
yes you don't need to delete the index and restart the tomcat just to get
the data query result updated, you just need to commit the xml files.

I made a custom url as per requirement from my client
default url -- 
http://localhost/solr/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on

my custom url --
http://localhost/search/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on

I made the custom url by copy paste the solr.war and renamed it to
search.war, so in webapps folder there are two war files
this is the cause of my problem, when I use the default url there is no
problem at all but when I use my custom url I have to delete, commit, and
restart the tomcat to make the query result correctly.

the question is now changed :)
how to make the search.war behave exactly the same like solr.war?
maybe when I start the tomcat I should add some parameter so it will
including/pointing to search.war not solr.war anymore?

when I removed the solr.war so there is only one war file in webapps folder
which is search.war, I can't do commit, it said 'FATAL: Solr returned an
error: Not Found'
it is because the app searching solr.war not search.war
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1070189.html
Sent from the Solr - User mailing list archive at Nabble.com.

delete Problem..

2010-08-10 Thread Jörg Agatz

Hallo Users...

I have a Problem, to delete some indext Item

i Tryed it with :

java -Ddata=args -jar
/home/service/solr/apache-solr-nightly/example/exampledocs/post.jar
deletequeryEMAIL_HEADER_FROM:test.de/query/delete

but Nothing,

EMAIL_HEADER_FROM is a String
and in the past it ever works. but now?
I cant delete it.

i can delete some mail when i tryed to delet only one like This:

java -Ddata=args -jar
/home/service/solr/apache-solr-nightly/example/exampledocs/post.jar
deletequery4b829265.7010...@test.de.20100803133543/query/delete

Re: Process entire result set

2010-08-10 Thread Eloi Rocha

Thanks Jonathan!

We decided to create offline results and store them in a Non-sql storage
(HBase). So we can answer the requests selecting one the the offline
generated results. This offline results are generated everyday.

Thanks!

Eloi

On Thu, Aug 5, 2010 at 8:59 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Eloi Rocha wrote:

 Hi everybody,

 I would like to know if does make sense to use Solr in the following
 scenario:
  - search for large amount of data (like 1000, 1, 10 registers)
  - each register contains four or five fields (strings and integers)
  - every time will request for entire result set (I can paginate the
 results). It would be much better to get all results at once [...]



 Depends on what kinds of searching you're doing. Are you doing searching
 that needs an indexer like Solr?  Then Solr is a good tool for your job.
  Are you not, and you can do what you want just as easily in an rdbms or
 non-sql store like MongoDB? Then I wouldn't use Solr.

 Assuming you really do need Solr, I think this should work, but I would not
 store the actual stored fields in Solr, I'd store those fields in an
 external store (key-value store, rdbms, whatever).   You store only what you
 need to index in Solr, you do your search, you get ID's back.  You ask for
 the entire result set back, why not.  If you give Solr enough RAM, and set
 your cache settings appropriately (really big document and related caches),
 then I _think_ it should perform okay. One way to find out.

 What you'd get back is just ID's, then you'd look up that ID in your
 external store to get your actual fields you want to operate on. _May_ not
 be neccesary, maybe you could do it with solr stored fields, but making Solr
 do only exactly what you really need from it (an index) will maximize it's
 ability to do what you need in available RAM.

 If you don't need Solr/Lucene indexing/faceting behavior, and you can do
 just fine with an rdbms or non-sql store, use that.

 Jonathan




-- 
Eloi Rocha Neto
Melon Tech - http://melontech.com.br
+55 83 8868-7025

Re: Indexing fieldvalues with dashes and spaces

Hi,

Try solr.KeywordTokenizerFactory.

However, in your case it looks as if you have certain requirements for 
searching that requires tokenization. So you should leave the 
WhitespaceTokenizer as is and create a separate field specially for the 
faceting, with indexed=true, stored=false and type=String. I often create a 
dynamic field for such, e.g. dynamicField name=*_facet... and then do a 
copyField.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 9. aug. 2010, at 09.54, PeterKerk wrote:

 
 Hi Erick,
 
 Ok. its more clear now. I indeed have the whitespace tokenizer:
 
fieldType name=textTrue class=solr.TextField
 positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords_dutch.txt /
filter class=solr.WordDelimiterFilterFactory
 generateWordParts=0 generateNumberParts=0 catenateWords=1
 catenateNumbers=1 catenateAll=0/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=Dutch
 protected=protwords.txt/
  /analyzer
/fieldType
 
 
 What happens is that I have a field called 'Beach  Sea, which is a theme
 for a location. What happens because of the whitespace tokenizer, it gets
 split up in 2 fields: 
Beach,2,
Sea,2],
 (see below)
 
 Ofcourse those individual facet names are NOT correct facetnames, because it
 should be Beach  Sea.
 But if I REMOVE the whitespace tokenizer, it throws an error that a
 fieldtype should always have a tokenizer.
 But which tokenizer would I need in order for me to have the correct facet
 name?
 (I've been checking this page
 btw:http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html)
 
 
 facet_counts:{
  facet_queries:{},
  facet_fields:{
   themes:[
Gemeentehuis,2,
Beach,2,
Sea,2],
   province:[
gelderland,1,
utrecht,1,
zuidholland,1],
   services:[
exclusiev,2,
fotoreportag,2,
hur,2,
liv,1,
muziek,1]},
  facet_dates:{}}}
 
 
 
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1052554.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH: Rows fetch OK, Total Documents Failed??

2010-08-10 Thread Alexey Serba

Do you have any required fields or uniqueKey in your schema.xml? Do
you provide values for all these fields?

AFAIU you don't need commonField attribute for id and title fields. I
don't think that's your problem but anyway...


On Sat, Jul 31, 2010 at 11:29 AM,  scr...@asia.com wrote:

  Hi,

 I'm a bit lost with this, i'm trying to import a new XML via DIH, all row are 
 fetched but no ducument are indexed? I don't find any log or error?

 Any ideas?

 Here is the STATUS:


 str name=commandstatus/str
 str name=statusidle/str
 str name=importResponse/
 lst name=statusMessages
 str name=Total Requests made to DataSource1/str
 str name=Total Rows Fetched7554/str
 str name=Total Documents Skipped0/str
 str name=Full Dump Started2010-07-31 10:14:33/str
 str name=Total Documents Processed0/str
 str name=Total Documents Failed7554/str
 str name=Time taken 0:0:4.720/str
 /lst


 My xml file looks like this:

 ?xml version=1.0 encoding=UTF-8?
 products
    product
        titleMoniteur VG1930wm 19 LCD Viewsonic/title
        
 urlhttp://x.com/abc?a(12073231)p(2822679)prod(89042332277)ttid(5)url(http%3A%2F%2Fwww.ffdsssd.com%2Fproductinformation%2F%7E66297%7E%2Fproduct.htm%26sender%3D2003)/url
        contentMoniteur VG1930wm 19  LCD Viewsonic VG1930WM/content
        price247.57/price
        categoryEcrans/category
    /product
 etc...

 and my dataconfig:

 dataConfig
        dataSource type=URLDataSource /
        document
                entity name=products
                        url=file:///home/john/Desktop/src.xml
                        processor=XPathEntityProcessor
                        forEach=/products/product
                        transformer=DateFormatTransformer

                        field column=id      xpath=/products/product/url  
  commonField=true /
                        field column=title   
 xpath=/products/product/title commonField=true /
                        field column=category  
 xpath=/products/product/category /
                        field column=content  
 xpath=/products/product/content /
                        field column=price      
 xpath=/products/product/price /

                /entity
        /document
 /dataConfig

Re: how to support implicit trailing wildcards

Hi,

You don't need to duplicate the content into two fields to achieve this. Try
this:

q=mount OR mount*

The exact match will always get higher score than the wildcard match because
wildcard matches uses constant score.

Making this work for multi term queries is a bit trickier, but something along
these lines:

q=(mount OR mount*) AND (everest OR everest*)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:

you could satisfy this by making 2 fields:
1. exactmatch
2. wildcardmatch

use copyfield in your schema to copy 1 -- 2 .

q=exactmatch:mount+wildcardmatch:mount*q.op=OR
this would score exact matches above (solely) wildcard matches

Geert-Jan

2010/8/10 yandong yao yydz...@gmail.com

Hi Bastian,

besides, also want the query to be processed with analyzer, while from

So seems built-in wildcard search could not satisfy my requirements if i
understand correctly.

Thanks very much!

2010/8/9 Bastian Spitzer bspit...@magix.net

Wildcard-Search is already built in, just use:

?q=umoun*
?q=mounta*

-Ursprüngliche Nachricht-
Von: yandong yao [mailto:yydz...@gmail.com]
Gesendet: Montag, 9. August 2010 15:57
An: solr-user@lucene.apache.org
Betreff: how to support implicit trailing wildcards

Hi everyone,

How to support 'implicit trailing wildcard *' using Solr, eg: using
Google
to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

From my point of view, there are several ways, both with disadvantages:

3) Write custom SearchComponent, while have no idea where/how to start
with.

Is there any other way in Solr to do this, any feedback/suggestion are
welcome!

Thanks very much in advance!

Re: Facet Fields - ID vs. Display Value

2010-08-10 Thread kenf_nc


If your concern is performance, faceting integers versus faceting strings, I
believe Lucene makes the differences negligible. Given that choice I'd go
with string. Now..if you need to keep an association between id and string,
you may want to facet a combined field  id:string or some other
delimiter. Then parse it on display. But you can use the id if you need to
hit a database or some other external source. If you don't ever need to
reference the ID, I wouldn't even put it in the index.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Facet-Fields-ID-vs-Display-Value-tp1062754p1072067.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr query result not read the latest xml file

Hi,

Beware that post.jar is just an example tool to play with the default example 
index located at /solr/ namespace. It is very limited and you shold look 
elsewhere for a more production ready and robust tool.

However, it has the ability to specify custom url. Please try:

java -jar post.jar -help

 SimplePostTool: version 1.2
 This is a simple command line tool for POSTing raw XML to a Solr
 port.  XML data can be read from files specified as commandline
 args; as raw commandline arg strings; or via STDIN.
 Examples:
   java -Ddata=files -jar post.jar *.xml
   java -Ddata=args  -jar post.jar 'deleteid42/id/delete'
   java -Ddata=stdin -jar post.jar  hd.xml
 Other options controlled by System Properties include the Solr
 URL to POST to, and whether a commit should be executed.  These
 are the defaults for all System Properties...
   -Ddata=files
   -Durl=http://localhost:8983/solr/update
   -Dcommit=yes
 


Thus for your index, try:
java -Durl=http://localhost:80/search/update -jar post.jar myfile.xml

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 10. aug. 2010, at 12.10, e8en wrote:

 
 finally I found out the cause of my problem
 yes you don't need to delete the index and restart the tomcat just to get
 the data query result updated, you just need to commit the xml files.
 
 I made a custom url as per requirement from my client
 default url -- 
 http://localhost/solr/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on
 
 my custom url --
 http://localhost/search/select/?q=ITEM_CAT:817version=2.2start=0rows=10indent=on
 
 I made the custom url by copy paste the solr.war and renamed it to
 search.war, so in webapps folder there are two war files
 this is the cause of my problem, when I use the default url there is no
 problem at all but when I use my custom url I have to delete, commit, and
 restart the tomcat to make the query result correctly.
 
 the question is now changed :)
 how to make the search.war behave exactly the same like solr.war?
 maybe when I start the tomcat I should add some parameter so it will
 including/pointing to search.war not solr.war anymore?
 
 when I removed the solr.war so there is only one war file in webapps folder
 which is search.war, I can't do commit, it said 'FATAL: Solr returned an
 error: Not Found'
 it is because the app searching solr.war not search.war
 -- 
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solr-query-result-not-read-the-latest-xml-file-tp1066785p1070189.html
 Sent from the Solr - User mailing list archive at Nabble.com.

RE: hl.usePhraseHighlighter

Thanks so much for your help! It works. I really appreciate it.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Monday, August 09, 2010 6:05 PM
To: solr-user@lucene.apache.org
Subject: RE: hl.usePhraseHighlighter

 I used text type and found the following in schema.xml. I
 don't know which ones I should remove. 
 ***

You should remove filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/ from both index and query time.

Re: delete Problem..

Hi,

Since EMAIL_HEADER_FROM is a String type, you need to specify the whole field 
every time. Wildcards could also work, but you'll get a problem with leading 
wildcards.

The solution would be to change the fieldType into a text type using e.g. 
StandardTokenizerFactory - if this does not break other functionality you need 
on that field. Then it would support searching part of the field. You should 
make this as a phrase search to avoid ambiguities.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 10. aug. 2010, at 12.29, Jörg Agatz wrote:

 Hallo Users...
 
 I have a Problem, to delete some indext Item
 
 i Tryed it with :
 
 java -Ddata=args -jar
 /home/service/solr/apache-solr-nightly/example/exampledocs/post.jar
 deletequeryEMAIL_HEADER_FROM:test.de/query/delete
 
 but Nothing,
 
 EMAIL_HEADER_FROM is a String
 and in the past it ever works. but now?
 I cant delete it.
 
 i can delete some mail when i tryed to delet only one like This:
 
 java -Ddata=args -jar
 /home/service/solr/apache-solr-nightly/example/exampledocs/post.jar
 deletequery4b829265.7010...@test.de.20100803133543/query/delete

Re: delete Problem..

2010-08-10 Thread kenf_nc


I'd try 2 things. 
First do a query
   q=EMAIL_HEADER_FROM:test.de
and make sure some documents are found. If nothing is found, there is
nothing to delete.

Second, how are you testing to see if the document is deleted? The physical
data isn't removed from the index until you Optimize I believe. Is it
possible your delete is working, but your method of verifying isn't telling
you it's marked for deletion?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/delete-Problem-tp1070347p1072581.html
Sent from the Solr - User mailing list archive at Nabble.com.

Improve Query Time For Large Index

2010-08-10 Thread Peter Karich

Hi,

I have 5 Million small documents/tweets (= ~3GB) and the slave index
replicates itself from master every 10-15 minutes, so the index is
optimized before querying. We are using solr 1.4.1 (patched with
SOLR-1624) via SolrJ.

Now the search speed is slow 2s for common terms which hits more than 2
mio docs and acceptable for others: 0.5s. For those numbers I don't use
highlighting or facets. I am using the following schema [1] and from
luke handler I know that numTerms =~20 mio. The query for common terms
stays slow if I retry again and again (no cache improvements).

How can I improve the query time for the common terms without using
Distributed Search [2] ?

Regards,
Peter.


[1]
field name=id type=tlong indexed=true stored=true
required=true /
field name=date type=tdate indexed=true stored=true /
!-- term* attributes to prepare faster highlighting. --
field name=txt type=text indexed=true stored=true
   termVectors=true termPositions=true termOffsets=true/

[2]
http://wiki.apache.org/solr/DistributedSearch

Re: Implementing lookups while importing data

2010-08-10 Thread Alexey Serba

 We are currently doing this via a JOIN on the numeric
 field, between the main data table and the lookup table, but this
 dramatically slows down indexing.
I believe SQL JOIN is the fastest and easiest way in your case (in
comparison with nested entity even using CachedSqlEntity). You
probably don't have proper indexes in your database - check SQL query
plan.

PDF file

I have a lot of pdf files. I am trying to import pdf files to solr and index 
them. I added ExtractingRequestHandler to solrconfig.xml. 

Please tell me if I need download some jar files. 

In the Solr1.4 Enterprise Search Server book, use following command to import a 
mccm.pdf.

curl 
'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@mccm.pdf

Please tell me if there is a way to import pdf files from a directory.

Thanks so much for your help!

RE: Improve Query Time For Large Index

2010-08-10 Thread Burton-West, Tom

Hi Peter,

A few more details about your setup would help list members to answer your 
questions.
How large is your index?  
How much memory is on the machine and how much is allocated to the JVM?
Besides the Solr caches, Solr and Lucene depend on the operating system's disk 
caching for caching of postings lists.  So you need to leave some memory for 
the OS.  On the other hand if you are optimizing and refreshing every 10-15 
minutes, that will invalidate all the caches, since an optimized index is 
essentially a set of new files.

Can you give us some examples of the slow queries?  Are you using stop words?  

If your slow queries are phrase queries, then you might try either adding the 
most frequent terms in your index to the stopwords list  or try CommonGrams and 
add them to the common words list.  (Details on CommonGrams here: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

Tom Burton-West

-Original Message-
From: Peter Karich [mailto:peat...@yahoo.de] 
Sent: Tuesday, August 10, 2010 9:54 AM
To: solr-user@lucene.apache.org
Subject: Improve Query Time For Large Index

Hi,

I have 5 Million small documents/tweets (= ~3GB) and the slave index
replicates itself from master every 10-15 minutes, so the index is
optimized before querying. We are using solr 1.4.1 (patched with
SOLR-1624) via SolrJ.

Now the search speed is slow 2s for common terms which hits more than 2
mio docs and acceptable for others: 0.5s. For those numbers I don't use
highlighting or facets. I am using the following schema [1] and from
luke handler I know that numTerms =~20 mio. The query for common terms
stays slow if I retry again and again (no cache improvements).

How can I improve the query time for the common terms without using
Distributed Search [2] ?

Regards,
Peter.


[1]
field name=id type=tlong indexed=true stored=true
required=true /
field name=date type=tdate indexed=true stored=true /
!-- term* attributes to prepare faster highlighting. --
field name=txt type=text indexed=true stored=true
   termVectors=true termPositions=true termOffsets=true/

[2]
http://wiki.apache.org/solr/DistributedSearch

Re: DIH and multivariable fields problems

2010-08-10 Thread Alexey Serba

 Have others successfully imported dynamic multivalued fields in a
 child entity using the DataImportHandler via the child entity returning
 multiple records through a RDBMS?
Yes, it's working ok with static fields.

I didn't even know that it's possible to use variables in field names
( dynamic names ) in DIH configuration. This use case is quite
unusual.

 This is increasingly more looking like a bug. To recap, I am trying to use
 the DIH to import multivalued dynamic fields and using a variable to name
 that field.
I'm not an expert in DIH source code but it seems there's special
processing of dynamic fields that prevents handling field type (and
multivalued attribute). Specifically there's conditional jump
(continue) over field type detection code in case of dynamic field
name ( see DataImporter:initEntity ). I guess the reason of such
behavior is that you can't determine field type based on dynamic field
name (${variable}_s) at that time (configuration parsing). I'm
wondering if it's possible to determine field types at runtime (when
actual field title_s name is resolved).

I encountered similar problem with implicit sql_column - solr_field
mapping using SqlEntityProcessor, i.e. when you select some columns
and do not explicitly list all these columns as fields entries in your
configuration. In this case field type detection doesn't work either.
I think that moving type detection process into runtime would solve
that problem also. Am i missing something obvious that prevents us
from doing field type detection at runtime?

Alex

On Tue, Aug 10, 2010 at 4:20 AM, harrysmith harrysmith...@gmail.com wrote:

 This is increasingly more looking like a bug. To recap, I am trying to use
 the DIH to import multivalued dynamic fields and using a variable to name
 that field.

 Upon further testing, the multivalued import works fine with a
 static/constant name, but only keeps the first record when naming the field
 dynamically. See below for relevant snips.

 From schema.xml :
 dynamicField name=*_s  type=string  indexed=true  stored=true
 multiValued=true /

 From data-config.xml :

 entity name=terms query=select distinct CORE_DESC_TERM from metadata
 where item_id=${item.DIVID_PK}
 entity name=metadata query=select * from metadata where
 item_id=${item.DIVID_PK} AND core_desc_term='${terms.CORE_DESC_TERM}' 
 field name=metadata_record_s column=TEXT_VALUE /
 /entity
 /entity

 
 Produces the following, note that there are 3 records that should be
 returned and are correctly done, with the field name being a constant.

 - result name=response numFound=1 start=0
 - doc
  str name=id9892962/str
 - arr name=metadata_record_s
  strrecord 1/str
  strrecord 2/str
  strrecord 3/str
  strPolygraph Newsletter Title/str
  /arr
 - arr name=title
  strPolygraph Newsletter Title/str
  /arr
  /doc
  /result

 ===

 Now, changing the field name to a variable..., note only the first record is
 retained for the 'Relation_s' field -- there should be 3 records.

 field name=metadata_record_s column=TEXT_VALUE /
 becomes
 field name=${terms.CORE_DESC_TERM}_s column=TEXT_VALUE /

 produces the following:
 - result name=response numFound=1 start=0
 - doc
 - arr name=Relation_s
  strrecord 1/str
  /arr
 - arr name=Title_s
  strPolygraph Newsletter Title/str
  /arr
  str name=id9892962/str
 - arr name=title
  strPolygraph Newsletter Title/str
  /arr
  /doc
  /result

 Only the first record is retained. There was also another post (which
 recieved no replies) in the archive that reported the same issue. The DIH
 debug logs do show 3 records correctly being returned, so somehow these are
 not getting added.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1065244.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Need help with facets

Hi guys,

I have a solr index whose documents have the following fields:

FirstName
LastName
RecruitedDate

I update the index when any of the three fields change for that specific person.

I need to get facets based on when someone was recruited. The facets are :

Recruited within 1 month
Recruited within 3 months
...

So if 10 people were recruited within the past month then the count
for rRecruited within 1 month will be 10.

Is there a way to calculate the facets from RecruitedDate? Or, will I
have to create another field (let's say) RecruitedDateFacet and store
the text in there?

My problem is that if I use a separate field for faceting and store a
string in it then if that person's information wasn't updated for a
month he would still fall in that category (since no delta query was
run)

Please advise on what is the best way to accomplish this.

Thanks in advance,

Moazzam

Re: delete Problem..

Are you running a commit command after every delete command? I had the
same problem with updates. I wasn't committing my updates.

- Moazzam Khan
http://moazzam-khan.com

On Tue, Aug 10, 2010 at 8:52 AM, kenf_nc ken.fos...@realestate.com wrote:

 I'd try 2 things.
 First do a query
   q=EMAIL_HEADER_FROM:test.de
 and make sure some documents are found. If nothing is found, there is
 nothing to delete.

 Second, how are you testing to see if the document is deleted? The physical
 data isn't removed from the index until you Optimize I believe. Is it
 possible your delete is working, but your method of verifying isn't telling
 you it's marked for deletion?
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/delete-Problem-tp1070347p1072581.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Need help with facets

 I have a solr index whose documents have the following
 fields:
 
 FirstName
 LastName
 RecruitedDate
 
 I update the index when any of the three fields change for
 that specific person.
 
 I need to get facets based on when someone was recruited.
 The facets are :
 
 Recruited within 1 month
 Recruited within 3 months
 ...
 
 So if 10 people were recruited within the past month then
 the count
 for rRecruited within 1 month will be 10.
 
 Is there a way to calculate the facets from RecruitedDate?

It is possible with facet.query; something like:

q=*:*facet=onfacet.query=RecruitedDate:[NOW-1MONTH TO 
NOW]facet.query=RecruitedDate:[NOW-3MONTHS TO NOW]

How to compile nightly build?

2010-08-10 Thread harrysmith


I am attempting to follow the instructions located at:

http://wiki.apache.org/solr/ExtractingRequestHandler#Getting_Started_with_the_Solr_Example

I have downloaded the most recent clean build from Hudson.

After running 'ant example' I get the following error:


C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29ant example
Buildfile: C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\build.xml

init-forrest-entities:

compile-lucene:

BUILD FAILED
C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\common-build.xml:214:
C:\solr_
build\modules\analysis\common does not exist.

Total time: 0 seconds
=

What is the correct procedure?


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-compile-nightly-build-tp1077115p1077115.html
Sent from the Solr - User mailing list archive at Nabble.com.

Do we need index analyzer for query elevation component

2010-08-10 Thread Darniz


Hello,
In order for query elevation we define a type. do we really need index time
analyzer for query elevation type. 
Let say we have some document already indexed and i added only the query
time analyzer, looks like solr reads the words in elevate.xml and map words
to the respective document. in that case why would we need index time
analyzers, unless i am missing something. 
Please let me know 

 fieldType name=elevateKeywordsType class=solr.TextField
positionIncrementGap=100 

analyzer type=query 
tokenizer class=solr.WhitespaceTokenizerFactory/ 
filter class=solr.LowerCaseFilterFactory/ 
/analyzer 
/fieldType 
darniz
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-index-analyzer-for-query-elevation-component-tp1077130p1077130.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: PDF file

Does anyone have any experience with PDF file? I really appreciate your help!
Thanks so much in advance.

-Original Message-
From: Ma, Xiaohui (NIH/NLM/LHC) [C] 
Sent: Tuesday, August 10, 2010 10:37 AM
To: 'solr-user@lucene.apache.org'
Subject: PDF file

I have a lot of pdf files. I am trying to import pdf files to solr and index 
them. I added ExtractingRequestHandler to solrconfig.xml. 

Please tell me if I need download some jar files. 

In the Solr1.4 Enterprise Search Server book, use following command to import a 
mccm.pdf.

curl 
'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@mccm.pdf

Please tell me if there is a way to import pdf files from a directory.

Thanks so much for your help!

Re: Improve Query Time For Large Index

2010-08-10 Thread Peter Karich

Hi Tom,

my index is around 3GB large and I am using 2GB RAM for the JVM although
a some more is available.
If I am looking into the RAM usage while a slow query runs (via
jvisualvm) I see that only 750MB of the JVM RAM is used.

 Can you give us some examples of the slow queries?

for example the empty query solr/select?q=
takes very long or solr/select?q=http
where 'http' is the most common term

 Are you using stop words?  

yes, a lot. I stored them into stopwords.txt

 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

this looks interesting. I read through
https://issues.apache.org/jira/browse/SOLR-908 and it seems to be in 1.4.
I only need to enable it via:

filter class=solr.CommonGramsFilterFactory ignoreCase=true 
words=stopwords.txt/

right? Do I need to reindex?

Regards,
Peter.

 Hi Peter,

 A few more details about your setup would help list members to answer your 
 questions.
 How large is your index?  
 How much memory is on the machine and how much is allocated to the JVM?
 Besides the Solr caches, Solr and Lucene depend on the operating system's 
 disk caching for caching of postings lists.  So you need to leave some memory 
 for the OS.  On the other hand if you are optimizing and refreshing every 
 10-15 minutes, that will invalidate all the caches, since an optimized index 
 is essentially a set of new files.

 Can you give us some examples of the slow queries?  Are you using stop words? 
  

 If your slow queries are phrase queries, then you might try either adding the 
 most frequent terms in your index to the stopwords list  or try CommonGrams 
 and add them to the common words list.  (Details on CommonGrams here: 
 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2)

 Tom Burton-West

 -Original Message-
 From: Peter Karich [mailto:peat...@yahoo.de] 
 Sent: Tuesday, August 10, 2010 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Improve Query Time For Large Index

 Hi,

 I have 5 Million small documents/tweets (= ~3GB) and the slave index
 replicates itself from master every 10-15 minutes, so the index is
 optimized before querying. We are using solr 1.4.1 (patched with
 SOLR-1624) via SolrJ.

 Now the search speed is slow 2s for common terms which hits more than 2
 mio docs and acceptable for others: 0.5s. For those numbers I don't use
 highlighting or facets. I am using the following schema [1] and from
 luke handler I know that numTerms =~20 mio. The query for common terms
 stays slow if I retry again and again (no cache improvements).

 How can I improve the query time for the common terms without using
 Distributed Search [2] ?

 Regards,
 Peter.


 [1]
 field name=id type=tlong indexed=true stored=true
 required=true /
 field name=date type=tdate indexed=true stored=true /
 !-- term* attributes to prepare faster highlighting. --
 field name=txt type=text indexed=true stored=true
termVectors=true termPositions=true termOffsets=true/

 [2]
 http://wiki.apache.org/solr/DistributedSearch


   


-- 
http://karussell.wordpress.com/

RE: PDF file

2010-08-10 Thread Sharp, Jonathan

Xiaohui,

You need to add the following jars to the lib subdirectory of the solr config 
directory on your server. 

(path inside the solr 1.4.1 download)

/dist/apache-solr-cell-1.4.1.jar
plus all the jars in 
/contrib/extraction/lib

HTH 

-Jon

From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov]
Sent: Tuesday, August 10, 2010 11:57 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: PDF file

Does anyone have any experience with PDF file? I really appreciate your help!
Thanks so much in advance.

-Original Message-
From: Ma, Xiaohui (NIH/NLM/LHC) [C]
Sent: Tuesday, August 10, 2010 10:37 AM
To: 'solr-user@lucene.apache.org'
Subject: PDF file

I have a lot of pdf files. I am trying to import pdf files to solr and index 
them. I added ExtractingRequestHandler to solrconfig.xml.

Please tell me if I need download some jar files.

In the Solr1.4 Enterprise Search Server book, use following command to import a 
mccm.pdf.

curl 
'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@mccm.pdf

Please tell me if there is a way to import pdf files from a directory.

Thanks so much for your help!



-
SECURITY/CONFIDENTIALITY WARNING:  
This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. 

-

RE: PDF file

Thanks so much for your help! I tried to index a pdf file and got the 
following. The command I used is 

curl 
'http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@pub2009001.pdf

Did I do something wrong? Do I need modify anything in schema.xml or other 
configuration file?


[xiao...@lhcinternal lhc]$ curl 
'http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@pub2009001.pdf
html
head
meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
titleError 404 /title
/head
bodyh2HTTP ERROR: 404/h2preNOT_FOUND/pre
pRequestURI=/solr/lhc/update/extract/ppismalla 
href=http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/   
 
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/
br/

/body
/html
***

-Original Message-
From: Sharp, Jonathan [mailto:jsh...@coh.org] 
Sent: Tuesday, August 10, 2010 4:37 PM
To: solr-user@lucene.apache.org
Subject: RE: PDF file

Xiaohui,

You need to add the following jars to the lib subdirectory of the solr config 
directory on your server. 

(path inside the solr 1.4.1 download)

/dist/apache-solr-cell-1.4.1.jar
plus all the jars in 
/contrib/extraction/lib

HTH 

-Jon

From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov]
Sent: Tuesday, August 10, 2010 11:57 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: PDF file

Does anyone have any experience with PDF file? I really appreciate your help!
Thanks so much in advance.

-Original Message-
From: Ma, Xiaohui (NIH/NLM/LHC) [C]
Sent: Tuesday, August 10, 2010 10:37 AM
To: 'solr-user@lucene.apache.org'
Subject: PDF file

I have a lot of pdf files. I am trying to import pdf files to solr and index 
them. I added ExtractingRequestHandler to solrconfig.xml.

Please tell me if I need download some jar files.

In the Solr1.4 Enterprise Search Server book, use following command to import a 
mccm.pdf.

curl 
'http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@mccm.pdf

Please tell me if there is a way to import pdf files from a directory.

Thanks so much for your help!



-
SECURITY/CONFIDENTIALITY WARNING:  
This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. 

-

Re: Need help with facets

Thanks Ahmet that worked!

Here's another issues I have :

Like I said before, I have these fields in Solr documents

FirstName
LastName
RecruitedDate
VolumeDate (just added this in this email)
VolumeDone (just added this in this email)


Now I have to get sum of all VolumeDone (integer field) for this month
by everyone, then take 25% of that number and get all people whose
volume was more than that. Is there a way to do this? :D

I did some research but I wasn't able to come up with an answer.

Thanks,

Moazzam



On Tue, Aug 10, 2010 at 1:42 PM, Ahmet Arslan iori...@yahoo.com wrote:
 I have a solr index whose documents have the following
 fields:

 FirstName
 LastName
 RecruitedDate

 I update the index when any of the three fields change for
 that specific person.

 I need to get facets based on when someone was recruited.
 The facets are :

 Recruited within 1 month
 Recruited within 3 months
 ...

 So if 10 people were recruited within the past month then
 the count
 for rRecruited within 1 month will be 10.

 Is there a way to calculate the facets from RecruitedDate?

 It is possible with facet.query; something like:

 q=*:*facet=onfacet.query=RecruitedDate:[NOW-1MONTH TO 
 NOW]facet.query=RecruitedDate:[NOW-3MONTHS TO NOW]

Re: How to compile nightly build?

You don't have to download the source. You can just download the
binary distribution from their site and run it without compiling it.

- Moazzam

On Tue, Aug 10, 2010 at 1:48 PM, harrysmith harrysmith...@gmail.com wrote:

 I am attempting to follow the instructions located at:

 http://wiki.apache.org/solr/ExtractingRequestHandler#Getting_Started_with_the_Solr_Example

 I have downloaded the most recent clean build from Hudson.

 After running 'ant example' I get the following error:


 C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29ant example
 Buildfile: C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\build.xml

 init-forrest-entities:

 compile-lucene:

 BUILD FAILED
 C:\solr_build\apache-solr-4.0-2010-07-27_08-06-29\common-build.xml:214:
 C:\solr_
 build\modules\analysis\common does not exist.

 Total time: 0 seconds
 =

 What is the correct procedure?


 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-compile-nightly-build-tp1077115p1077115.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Modifications to AbstractSubTypeFieldType

2010-08-10 Thread Lance Norskog

Compound types are young and will probably mutate. I will do my own
hack until things settle down.

Lance

On Mon, Jul 12, 2010 at 12:47 AM, Mark Allan mark.al...@ed.ac.uk wrote:
 On 7 Jul 2010, at 6:24 pm, Yonik Seeley wrote:

 On Wed, Jul 7, 2010 at 8:15 AM, Grant Ingersoll gsing...@apache.org
 wrote:

 Originally, I had intended that it was just for one Field Sub Type,
 thinking that if we ever wanted multiple sub types, that a new, separate
 class would be needed


 Right - this was my original thinking too.  AbstractSubTypeFieldType
 is only a convenience class to create compound types... people can do
 it other ways.

 Just for clarification, does that mean my modifications won't be included?
  If so, can you let me know so that I can extract the changes and maintain
 them in a different package structure from the main Solr code please.

 Cheers
 Mark

 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.





-- 
Lance Norskog
goks...@gmail.com

Re: How to compile nightly build?

2010-08-10 Thread harrysmith


In this particular case I would like to get the trunk. Is there a different
link for binary distributions of nightly builds?

I had been downloading from here: 
http://hudson.zones.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/

In the case I did want to compile from the source, am I missing a step?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-compile-nightly-build-tp1077115p1080266.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 1.4 - stats page slow

2010-08-10 Thread entdeveloper


Apologies if this was resolved, but we just deployed Solr 1.4.1 and the stats
page takes over a minute to load for us as well and began causing
OutOfMemory errors so we've had to refrain from hitting the page. From what
I gather, it is the fieldCache part that's causing it.

Was there ever an official fix or recommendation on how to disable the stats
page from calculating the fieldCache entries? If we could just ignore it, I
think we'd be good to go since I find this page very useful otherwise. 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-1-4-stats-page-slow-tp498810p1081193.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDF file

2010-08-10 Thread Jayendra Patil

Try ...

curl 
http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file=
Full_Path_of_File/pub2009001.pdfliteral.id=777045commit=true

stream.file - specify full path
literal.extra params - specify any extra params if needed

Regards,
Jayendra

On Tue, Aug 10, 2010 at 4:49 PM, Ma, Xiaohui (NIH/NLM/LHC) [C] 
xiao...@mail.nlm.nih.gov wrote:

 Thanks so much for your help! I tried to index a pdf file and got the
 following. The command I used is

 curl '
 http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@pub2009001.pdf

 Did I do something wrong? Do I need modify anything in schema.xml or other
 configuration file?

 
 [xiao...@lhcinternal lhc]$ curl '
 http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@pub2009001.pdf
 html
 head
 meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/
 titleError 404 /title
 /head
 bodyh2HTTP ERROR: 404/h2preNOT_FOUND/pre
 pRequestURI=/solr/lhc/update/extract/ppismalla href=
 http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/
 br/

 /body
 /html
 ***

 -Original Message-
 From: Sharp, Jonathan [mailto:jsh...@coh.org]
 Sent: Tuesday, August 10, 2010 4:37 PM
 To: solr-user@lucene.apache.org
 Subject: RE: PDF file

 Xiaohui,

 You need to add the following jars to the lib subdirectory of the solr
 config directory on your server.

 (path inside the solr 1.4.1 download)

 /dist/apache-solr-cell-1.4.1.jar
 plus all the jars in
 /contrib/extraction/lib

 HTH

 -Jon
 
 From: Ma, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov]
 Sent: Tuesday, August 10, 2010 11:57 AM
 To: 'solr-user@lucene.apache.org'
 Subject: RE: PDF file

 Does anyone have any experience with PDF file? I really appreciate your
 help!
 Thanks so much in advance.

 -Original Message-
 From: Ma, Xiaohui (NIH/NLM/LHC) [C]
 Sent: Tuesday, August 10, 2010 10:37 AM
 To: 'solr-user@lucene.apache.org'
 Subject: PDF file

 I have a lot of pdf files. I am trying to import pdf files to solr and
 index them. I added ExtractingRequestHandler to solrconfig.xml.

 Please tell me if I need download some jar files.

 In the Solr1.4 Enterprise Search Server book, use following command to
 import a mccm.pdf.

 curl '
 http://localhost:8983/solr/solr-home/update/extract?map.content=textmap.stream_name=idcommit=true'
 -F fi...@mccm.pdf

 Please tell me if there is a way to import pdf files from a directory.

 Thanks so much for your help!



 -
 SECURITY/CONFIDENTIALITY WARNING:
 This message and any attachments are intended solely for the individual or
 entity to which they are addressed. This communication may contain
 information that is privileged, confidential, or exempt from disclosure
 under applicable law (e.g., personal health information, research data,
 financial information). Because this e-mail has been sent without
 encryption, individuals other than the intended recipient may be able to
 view the information, forward it to others or tamper with the information
 without the knowledge or consent of the sender. If you are not the intended
 recipient, or the employee or person responsible for delivering the message
 to the intended recipient, any dissemination, distribution or copying of the
 communication is strictly prohibited. If you received the communication in
 error, please notify the sender immediately by replying to this message and
 deleting the message and any accompanying files from your system. If, due to
 the security risks, you do not wish to receive further communications via
 e-mail, please reply to this message and inform the sender that you do not
 wish to receive further e-mail from the sender.

 -

Re: DIH and multivariable fields problems

2010-08-10 Thread kenf_nc


Glad I could help. I also would think it was a very common issue. Personally
my schema is almost all dynamic fields. I have unique_id, content,
last_update_date and maybe one other field specifically defined, the rest 
are all dynamic. This lets me accept an almost endless variety of document
types into the same schema.  So if I planned on using DIH I had to come up
with a way, and stitching together solutions to a couple related issues got
me to my script transform. Mine is more convoluted than the one I gave here,
but obviously you got the gist of the idea.


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-and-multivariable-fields-problems-tp1032893p1081738.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr query result not read the latest xml file