from:"Michael Lackhoff"

Replication not triggered

2015-04-27 Thread Michael Lackhoff

We have old fashioned replication configured between one master and one
slave. Everything used to work but today I noticed that recent records
were not present in the slave (same query gives hits on master but non
on slave).
The replication communication seems to work. This is what I get in the logs:

INFO: [default] webapp=/solr path=/replication
params={command=fetchindex_=1430136325501wt=json} status=0 QTime=0
Apr 27, 2015 2:05:25 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.
Apr 27, 2015 2:05:25 PM org.apache.solr.core.SolrCore execute
INFO: [default] webapp=/solr path=/replication
params={command=details_=1430136325600wt=json} status=0 QTime=21

It says both are in sync, but obviously they are not and even the
replication page of the admin view mentions different Version, Gen and size:
Master (Searching)  1430107573634 27 287.19 GB
Master (Replicable) 1430107573634 27 -
Slave (Searching)   1429762011916 23 287.14 GB

Any idea why the replication is not triggered here or what I could try
to fix it?
Solr Version is 4.10.3.

-Michael

Re: variaton on boosting recent documents gives exception

2015-02-13 Thread Michael Lackhoff

Am 13.02.2015 um 11:18 schrieb Gonzalo Rodriguez:

You can always change the type of your sortyear field to an int, or create an
int version of it and use copyField to populate it.

But that would require me to reindex. Would be nice to have some type
conversion available within a function query.

And using NOW/YEAR will round the current date to the start of the year, you
can read more about this in the Javadoc:
http://lucene.apache.org/solr/4_10_3/solr-core/org/apache/solr/util/DateMathParser.html

You can test it using the example collection:
http://localhost:8983/solr/collection1/select?q=*:*boost=recip(ms(NOW/YEAR,manufacturedate_dt),3.16e-11,1,1)fl=id,manufacturedate_dt,score,[explain]defType=edismax
and checking the explain field for the numeric value given to NOW/YEAR vs
NOW/HOUR, etc.

The definition of *_dt fields int the example-schema is 'date' but my
field is text or (t)int if I have to reindex.

To compare against this int field I need another (comparable) int.
ms(NOW/YEAR,manufacturedate_dt) is an int, but a huge one, which is very
difficult to bring into a sensible relationship to e.g. '2015'.

Your suggestion would only work if I change my year to a date like
2015-01-01T00:00:00Z which is not a sensible format for a publication
year and not even easily creatable by copyfield.

What I need is a real year number, not a date truncated to the year,
which is only accessible as the number of milliseconds since the epoch
of Jan, 1st 00:00:00h, which is not very handy.

-Michael

variaton on boosting recent documents gives exception

2015-02-12 Thread Michael Lackhoff

Since my field to measure recency is not a date field but a string field
(with only year-numbers in it), I tried a variation on the suggested
boost function for recent documents:
  recip(sub(2015,min(sortyear,2015)),1,10,10)
But this gives an exception when used in a boost or bf parameter.
I guess the reason is that all the mathematics doesn't work with a
string field even if it only contains numbers. Am I right with this
guess? And if so, is there a function I can use to change the type to
something numeric? Or are there other problems with my function?

Another related question: as you can see the current year (2015) is hard
coded. Is there an easy way to get the current year within the function?
Messing around with NOW looks very complicated.

-Michael

pf doesn't work like normal phrase query

2015-01-11 Thread Michael Lackhoff

My aim is to boost exactish matches similar to the recipe described in
[1]. The anchoring works in q but not in pf, where I need it. Here is an
example that shows the effect:
q=title_exact:anatomiepf=title_exact^2000
debugQuery says it is interpreted this way:
+title_exact: anatomie  (title_exact: ^2000.0)

As you can see the the contents of q is missing in the boosted part.
Of course I also tried more realistic variants like
q=title:anatomiepf=title_exact^10
(regular field and no quotes in q, exact field in pf)
gives: +title:anatomie (title_exact: ^10.0)

The fieldType definition is not exactly as in [1] but very similar and
working in q (see first example above).

Here are the relevant parts of my schema.xml:
field name=title_exact type=text_lr indexed=true stored=false
multiValued=true/
copyField source=title dest=title_exact /
fieldType name=text_lr class=solr.TextField
  positionIncrementGap=100
  analyzer
charFilter class=solr.PatternReplaceCharFilterFactory
  pattern=^(.*)$ replacement= $1  /
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=1 catenateAll=0
splitOnCaseChange=1 /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldType

Any idea what is going wrong here? And even more important how I can fix it?

--Michael

[1]
http://robotlibrarian.billdueber.com/2012/03/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/

Re: pf doesn't work like normal phrase query

2015-01-11 Thread Michael Lackhoff

Am 11.01.2015 um 14:01 schrieb Ahmet Arslan:

 What happens when you do not use fielded query?
 
 q=anatomieqf=title_exact
 instead of
 
 q=title_exact:anatomie

Then it works (with qf=title):
+(title:anatomie) (title_exact: anatomie ^20.0)

Only problem is that my frontend always does a fielded query.

Is there a way to make it work for fielded query?
Or put another way: How can I do this boost in more complex queries like:
title:foo AND author:miller AND year:[2010 TO *]
It would be nice to have a title foo before another title some foo
and bar (given the other criteria also match both titles).
In such cases it is almost impossible to move the search fields to the
qf parameter.

--Michael

Re: pf doesn't work like normal phrase query

2015-01-11 Thread Michael Lackhoff

Am 11.01.2015 um 14:19 schrieb Michael Lackhoff:

 Or put another way: How can I do this boost in more complex queries like:
 title:foo AND author:miller AND year:[2010 TO *]
 It would be nice to have a title foo before another title some foo
 and bar (given the other criteria also match both titles).
 In such cases it is almost impossible to move the search fields to the
 qf parameter.

How about this one: It should be possible to construct a query with a
combination of more than one query parser. Is it possible to get this
pseudo-code-variant of the above example into a working search-URL?:
(defType=edismax
  q=anantomie
  qf=title10 related_title^5
  pf=title_exact^20
)
AND
(defType=edismax
  q=miller
  qf=author^10 editor^5
)
AND
(defType=edismax or perhaps other defType
 q=[2010 TO *]
 qf=year
)

My knowledge of the syntax is just not good enough to build such a beast
and test it. What would a select-request look like to do such a query?
Or would it be far too slow because of the complexity?

--Michael

Re: pf doesn't work like normal phrase query

2015-01-11 Thread Michael Lackhoff

Hi Ahmet,

 You might find this useful : 
 https://lucidworks.com/blog/whats-a-dismax/

I have a basic understanding but will do further reading...

 Regarding your example : title:foo AND author:miller AND year:[2010 TO *]
 last two clauses better served as a filter query.
 
 http://wiki.apache.org/solr/CommonQueryParameters#fq

You are right for a hand crafted query but I have to deal with arbitrary
complex user queries which are syntax-checked within the front end
application but not much more. I find it difficult to automatically
detect what part of the query can be moved to a filter query.

 By the way it is possible to combine different query parsers in a single 
 query, but I believe your use-case does not need that.
 https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries

Perhaps not, but how can I tackle my original problem then? Is there a
way to boost exact titles (or whatever is in pf for that matter) within
fielded queries, since that is what I have to deal with? The example
above was just that -- an example -- people can come up with all sorts
of complex/fielded queries but most of them contain a title (or part of
it) and I want to boost those that have an exact(ish) match.

--Michael

Re: pf doesn't work like normal phrase query

2015-01-11 Thread Michael Lackhoff

Am 11.01.2015 um 18:30 schrieb Jack Krupansky:

 It's still not quite clear to me what your specific goal is. From your
 vague description it seems somewhat different from the blog post that you
 originally cited. So, let's try one more time... explain in plain English
 what use case you are trying to satisfy.

I think it is the use case from the blog entry. I got the complaint that
users didn't find (at least not on the first result page) titles they
entered exactly -- and I wanted to fix this by boosting exact matches.
The example given to me was the title Anatomie. So I tried it:
title:anatomie and got lots of hits all of which contained the word in
the title but among the first 10 hits there was none with the (exact)
title Anatomie the user was looking for.
As next step I did a web search, found the blog entry, implemented it,
was happy with the simple case but couldn't make it work with fielded
queries (which we have to support, see below).

At the moment we even have only fielded queries since the Application
makes the default search field explicit -- which I could change but
would like to keep if possible. But even if I change this case I still
have to cope with fielded queries that are not just targeting the
default search field.

 You mention fielded queries, but in my experience very few end-users would
 know about let alone use them. So, either you are giving your end-users
 specific guidance for writing queries - in which case you can give them
 more specific guidance that achieves your goals, or if these fielded
 queries are in fact generated by the client or app layer code, then maybe
 you just need to put more intelligence into that query-generation code in
 the client.

It is the old library search problem: most users don't use it but we
also have various kinds of experts amoung our users (few but important)
who really use all the bells and whistles.

And I have to somehow satisfy both groups: those who only do a
one-word-search within the default search field and those with complex
fielded queries -- and both should find titles they enter exactly at the
top, even if combined with dozens of other criteria.

And it doesn't really help to question the demand since the demand is
there and somewhat external. The point is how to best meet it.

--Michael

Re: pf doesn't work like normal phrase query

2015-01-11 Thread Michael Lackhoff

Thanks everyone for all the advice!

To sum up there seems to be no easy solution. I only have the option to
either
- make things really complicated
- only help some users/query structures
- accept the status quo

What could help is an analogon to field aliases:
If it was possible to say
f.title.pf=title_exact^10 title_proper^5
analogous to (the existing)
f.title.qf=title_proper^10 title_related
everything should work just fine

But I guess this will only come if or when one of the developers has an
itch to scratch ;-)

Anyway, thanks a lot for all help and a great product
--Michael

Re: Solution for reverse order of year facets?

2014-03-04 Thread Michael Lackhoff


Hi Ahmet,


I forgot to include what I did for one customer :

1) Using StatsComponent I get min and max values of the field (year)
2) Calculate smart gap/range values according to minimum and maximum.
3) Re-issue the same query (for thee second time) that includes a set of 
facet.query.


It's amazing, everyone I am talking with about this problem seems to 
remember some hack(s) to work around the problem ;-)


On one hand it shows, there are some options (and thanks for giving me 
some more!) but on the other hand it also shows how much need there is 
for a real solution like SOLR-1672. I really hope Shawn finds some time 
to make it work.


-Michael

Solution for reverse order of year facets?

2014-03-03 Thread Michael Lackhoff

If I understand the docs right, it is only possible to sort facets by
count or value in ascending order. Both variants are not very helpful
for year facets if I want the most recent years at the top (or appear at
all if I restrict the number of facet entries).

It looks like a requirement that was articulated repeatedly and the
recommended solution seems to be to do some math like 1 - year and
index that. So far so good. Only problem is that I have many data
sources and I would like to avoid to change every connector to include
the new field. I think a better solution would be to have a custom
TokenFilterFactory that does it.

Since it seems a common request, did someone already build such a
TokenFilterFactory? If not, do you think I could build one myself? I do
some (script-)programming but have no experience with Java, so I think I
could adapt an example. Are there any guides out there?

Or even better, is there a built-in solution I haven't heard of?

-Michael

Re: Solution for reverse order of year facets?

2014-03-03 Thread Michael Lackhoff

On 03.03.2014 16:33 Ahmet Arslan wrote:

 Currently there are two storing criteria available. However sort by index - 
 to return the constraints sorted in their index order (lexicographic by 
 indexed term) - should return most recent year at top, no?

No, it returns them -- as you say -- in lexicographic order and that
means oldest first, like:
1815
1820
...
2012
2013
(might well stop before we get here)
2014

-Michael

Re: Solution for reverse order of year facets?

2014-03-03 Thread Michael Lackhoff

Hi Ahmet,

 There is no built in solution for this.

Yes, I know, that's why I would like the TokenFilterFactory

 Two workaround :
 
 1) use facet.limit=-1 and invert the list (faceting response) at client side
 
 2) use multiples facet.query
a)facet.query=year:[2012 TO 2014]facet.query=year:[2010 TO 2012] 
b)facet.query=year:2014facet.query=year:2013 ...

I thought about these but they have the disadvantage that 1) could
return hundreds of facet entries. 2b) is better but would need about 30
facet-queries which makes quite a long URL and it wouldn't always work
as expected. There are subjects that were very popular in the past but
with no (or very few) recent publications. For these I would get empty
results for my 2014-1985 facet-queries but miss all the stuff from the
1960s.

From all these thoughts I came to the conclusion that a custom
TokenFilterFactory could do exactly what I want. In effect it would give
me a reverse sort:
1 - 2014 = 7986
1 - 2013 = 7987
...
The client code can easily regain the original year values for display.

And I think it shouldn't be too difficult to write such a beast, only
problem is I am not a Java programmer. That is why I asked if someone
has done it already or if there is a guide I could use.
After all it is just a simple subtraction...

-Michael

Re: Solution for reverse order of year facets?

2014-03-03 Thread Michael Lackhoff

On 03.03.2014 19:58 Shawn Heisey wrote:

 There's already an issue in Jira.
 
 https://issues.apache.org/jira/browse/SOLR-1672

Thanks, this is of course the best solution. Only problem is that I use
a custom verson from a vendor (based on version 4.3) I want to enhance.
But perhaps they apply the patch. In the meantime I still think the
custom filter could be a workaround.

 I can't take a look now, but I will later if someone else hasn't taken 
 it up.

That would be great!

Thanks
-Michael

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff

On 13.08.2011 18:03 Erick Erickson wrote:

 The problem I've always had is that I don't quite know what
 sorting on multivalued fields means. If your field had tokens
 a and z, would sorting on that field put the doc
 at the beginning or end of the list? Sure, you can define
 rules (first token, last token, average of all tokens (whatever
 that means)), but each solution would be wrong sometime,
 somewhere, and/or completely useless.

Of course it would need rules but I think it wouldn't be too hard to
find rules that are at least far better than the current situation.

My wish would include an option that decides if the field can be used
just once or every value on its own. If the option is set to FALSE, only
the first value would be used, if it is TRUE, every value of the field
would get its place in the result list.

so, if we have e.g.
record1: ccc and bbb
record2: aaa and zzz
it would be either
record2 (aaa)
record1 (ccc)
or
record2 (aaa)
record1 (bbb)
record1 (ccc)
record2 (zzz)

I find these two outcomes most plausible so I would allow them if
technical possible but whatever rule looks more plausible to the
experts: some solution is better than no solution.

-Michael

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff

On 13.08.2011 20:31 Martijn v Groningen wrote:

 The first solution would make sense to me. Some kind of a strategy
 mechanism
 for this would allow anyone to define their own rules. Duplicating results
 would be confusing to me.

That is why I would only activate it on request (setting a special
option). Example use case: A library catalogue with an author sort. All
books of an author would be together, no matter how many co-authors the
book has.
So I think it could be useful (as an option) but I have no idea how
diffcult it would be to implement. As I said, it would be nice to have
at least something. Any possible customization would be an extra bonus.

-Michael

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff

On 13.08.2011 21:28 Erick Erickson wrote:

 Fair enough, but what's first value in the list?
 There's nothing special about mutliValued fields,
 that is where the schema has multiValued=true.
 under the covers, this is no different than just
 concatenating all the values together and putting them
 in at one go, except for some games with the
 position between one term and another
 (positionIncrementGap). Part of my confusion is
 that the term multi-valued is sometimes used to
 refer to multiValued=true and sometimes used
 to refer to documents with more than one
 *token* in a particular field (often as the result
 of the analysis chain)

I guess, since multivalued fields are not really different under the
hood, they should be treated the same. So, no matter if the different
values are the result of a multiValued=true or of the analysis chain:
if the whole thing starts with an a put it first, if it starts with a
z put it last.
Example (multivalued field):
Smith, Adam
Duck, Dagobert
= sort as s (or S)
Example tokenized field:
This is a tokenized field
= sort as t (or T)

 The second case seems to be more in the
 grouping/field collapsing arena, although
 that doesn't work on fields with more than one
 value yet either. But that seems a more sensible
 place to put the second case rather than
 overloading sorting.

It depends how you see the meaning of sorting:
1. Sort the records based on one single value per record (and return
them in this order)
2. Sort the values of the field to sort on (and return the records
belonging to the respective values)

As long as sorting is only allowed on single value fields, both are
identical. As soon as you allow multivalued fields to be sorted on, both
interpretations mean something different and I think both have their
valid use case.
But I don't want to stress this too far.

-Michael

Re: problem in setting field attribute in schema.xml

2011-05-26 Thread Michael Lackhoff


Am 26.05.2011 12:52, schrieb Romi:

i have done it, i deleted old indexes and created new indexes but still able
to search it through *:*, and no result when i search it as field:value.
really surprising result. :-O


I really don't understand your problem. Thist is not at all surprising 
but the expected behaviour:
*:* just gives you every document in your index, no matter what of the 
document is stored or indexed, it just gives _everything_ whereas
field:value does an actual search if there is an indexed value value 
in field field. So no surprise either that you didn't get a result 
here if you didn't index field.


-Michael

Re: problem in setting field attribute in schema.xml

2011-05-26 Thread Michael Lackhoff


Am 26.05.2011 14:10, schrieb Romi:

did u mean when i set indexed=false and store=true, solr does not index
the field's value but store its value as it is???


I don't know if you are asking me since you do not quote anything but 
yes of course this is exactly the purpose of indexed and stored.


-Michael

Re: problem in setting field attribute in schema.xml

2011-05-25 Thread Michael Lackhoff


Am 25.05.2011 15:47, schrieb Vignesh Raj:

It's very strange. Even I tried the same now and am getting the same result.
I have set both indexed=false and stored=false.
But still if I search for a keyword using my default search, I get the
results in these fields as well.
But if I specify field:value, it shows 0 results.

Can anyone explain?


I guess you copy the field to your default search field.

-Michael

Re: Is semicolon a character that needs escaping?

2010-09-08 Thread Michael Lackhoff

On 08.09.2010 00:05 Chris Hostetter wrote:

 
 : Subject: Is semicolon a character that needs escaping?
   ...
 : From this I conclude that there is a bug either in the docs or in the
 : query parser or I missed something. What is wrong here?
 
 Back in Solr 1.1, the standard query parser treated ; as a special 
 character and looked for sort instructions after it.  
 
 Starting in Solr 1.2 (released in 2007) a sort param was added, and 
 semicolon was only considered a special character if you did not 
 explicilty mention a sort param (for back compatibility)
 
 Starting with Solr 1.4, the default was changed so that semicolon wasn't 
 considered a meta-character even if you didn't have a sort param -- you 
 have to explicilty select the lucenePlusSort QParser to get this 
 behavior.
 
 I can only assume that if you are seeing this behavior, you are either 
 using a very old version of Solr, or you have explicitly selected the 
 lucenePlusSort parser somewhere in your params/config.
 
 This was heavily documented in CHANGES.txt for Solr 1.4 (you can find 
 mention of it when searching for either ; or semicolon)

I am using 1.3 without a sort param which explains it, I think. It would
be nice to update to 1.4 but we try to avoid such actions on a
production server as long as everything runs fine (the semicolon thing
was only reported recently).

Many thanks for your detailed explanation!
-Michael

Is semicolon a character that needs escaping?

2010-09-02 Thread Michael Lackhoff

According to http://lucene.apache.org/java/2_9_1/queryparsersyntax.html
only these characters need escaping:
+ -  || ! ( ) { } [ ] ^  ~ * ? : \
but with this simple query:
TI:stroke; AND TI:journal
I got the error message:
HTTP ERROR: 400
Unknown sort order: TI:journal

My first guess was that it was a URL encoding issue but everything looks
fine:
http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournalversion=2.2start=0rows=10indent=on
as you can see, the semicolon is encoded as %3B
There is no problem when the query ends with the semicolon:
TI:stroke;
gives no error.
The first query also works if I escape the semicolon:
TI:stroke\; AND TI:journal

From this I conclude that there is a bug either in the docs or in the
query parser or I missed something. What is wrong here?

-Michael

Re: Is semicolon a character that needs escaping?

2010-09-02 Thread Michael Lackhoff

On 03.09.2010 00:57 Ken Krugler wrote:

 The docs need to be updated, I believe. From some code I wrote back in  
 2006...
 [...]

Thanks this explains it very well.

 But in general escaping characters in a query gets tricky - if you can  
 directly build queries versus pre-processing text sent to the query  
 parser, you'll save yourself some pain and suffering.

What do you mean by these two alternatives? That is, what exactly could
I do better?

 Also, since I did the above code the DisMaxRequestHandler has been  
 added to Solr, and it (IIRC) tries to be smart about handling this  
 type of escaping for you.

Dismax is not (yet) an option because we need the full lucene syntax
within the query. Perhaps this will change with the new enhanced dismax
request handler but I didn't play with it enough (will do with the next
release).

-Michael

Re: Is semicolon a character that needs escaping?

2010-09-02 Thread Michael Lackhoff

Hi Ken,

 But in general escaping characters in a query gets tricky - if you  
 can
 directly build queries versus pre-processing text sent to the query
 parser, you'll save yourself some pain and suffering.

 What do you mean by these two alternatives? That is, what exactly  
 could
 I do better?
 
 By can build..., I meant if you can come up with a GUI whereby the  
 user doesn't have to use special characters (other than say quoting)  
 then you can take a collection of clauses and programmatically build  
 your query, without using the query parser.

I think I have that (escaping of characters that have a special meaning
in Solr). I just didn't know that the semicolon is one of them. So it
would be nice if the docs could be updated to account for this.

Thanks again
-Michael

Re: Very basic questions: Indexing text

2010-06-28 Thread Michael Lackhoff

On 28.06.2010 23:00 Ahmet Arslan wrote:

 1) I can get my docs in the index, but when I search, it
 returns the entire document.  I'd love to have it only
 return the line (or two) around the search term.
 
 Solr can generate Google-like snippets as you describe. 
 http://wiki.apache.org/solr/HighlightingParameters

I didn't know this is possible and am also interested in this feature
but even after reading the given Wiki page I cannot make out which is
the parameter to use. The only paramter that could be similar is
'hl.maxAlternateFieldLength' where it is possible to give a length to
return but according to the description that is for the case no match.
And there is hl.fragmentsBuilder but with no explanation (the refered
page SolrFragmentsBuilder does not yet exist).

Could you give an example?
E.g. lets say I have a field 'title' and a field 'fulltext' and my
search term is 'solr'. What would be the right set of parameters to get
back the whole title-field but only a sniplet of 50 words (or three
sentences or whatever the unit) from the fulltext field.


Thanks
-Michael

Re: exceptionhandling error-reporting?

2010-04-06 Thread Michael Lackhoff

On 06.04.2010 17:49 Alexander Rothenberg wrote:

 On Monday 05 April 2010 20:14:44 Chris Hostetter wrote:
 define crashes ? ... presumabl you are tlaking about the client crashing
 because it can't parse theerro response, correct? ... the best suggestion
 given the current state of Solr is to make hte client smart enough to not
 attempt parsing of hte response unless the response code is 200.
 
 Yes, it tries to parse the HTML-output but exspecting JSON syntax. Because it 
 is a perl-mod from CPAN, i dont really want to customize it...

You don't have to. Just wrap the call in an eval, at least that is what
I do.

-Michael

Re: Confused by Solr Ranking

2010-03-09 Thread Michael Lackhoff

On 09.03.2010 16:01 Ahmet Arslan wrote:

 
 I kind of suspected stemming to be the reason behind this.
 But I consider stemming to be a good feature.
 
 This is the side effect of stemming. Stemming increases recall while harming 
 precision.

But most people want the best possible combination of both, something like:
(raw_field:word OR stemmed_field:word^0.5)
and it is nice that Solr allows such arrangements but it would be even
nicer to have some sort of automatic take this field, transform the
contents in a couple of ways and do some boosting in the order given.
At least this would be my wish for the recent question about the one
feature I would like to see.
Or even better, allow not only a hierarchy of transformations but also a
hierarchy of fields (like in dismax, but with the full power of the
standard request handler)

-Michael

Re: schema-based Index-time field boosting

2009-11-23 Thread Michael Lackhoff

On 23.11.2009 19:33 Chris Hostetter wrote:

 ...if there was a way to oost fields at index time that was configured in 
 the schema.xml, then every doc would get that boost on it's instances of 
 those fields but the only purpose of index time boosting is to indicate 
 that one document is more significant then another doc -- if every doc 
 gets the same boost, it becomes a No-OP.
 
 (think about the math -- field boosts become multipliers in the fieldNorm 
 -- if every doc gets the same multiplier, then there is no net effect)

Coming in a bit late but I would like a variant that is not a No-OP.
Think of something like title:searchstring^10 OR catch_all:searchstring
Of course I can always add the boosting at query time but it would make
life easier if I could define a default boost in the schema so that my
query could just be title:searchstring OR catch_all:searchstring
but still get the boost for the title field.

Thinking this further it would be even better if it was possible to
define one (or more) fallback field(s) with associated boost factor in
the schema. Then it would be enough to query for title:searchstring and
it would be automatically expanded to e.g.
title:searchstring^10 OR title_other_language:searchstring^5 OR
catchall:searchstring
or whatever you define in the schema.

-Michael

Re: How to import multiple RSS-feeds with DIH

2009-11-09 Thread Michael Lackhoff

On 09.11.2009 09:46 Noble Paul നോബിള്‍ नोब्ळ् wrote:

 When you say , the second example does not work , what does it mean?
 some exception?(if yes, please post the stacktrace)

Very mysterious. Now it works but I am sure I got an exception before.
All I remember is something like java.io.IOException: FULL. In the
right frame of the DIH debugging screen I got an error message from
firefox: the connection was reset while displaying the page.

But I don't think it is reproducable now, perhaps some unrelated problem
like low memory or such. Thanks anyway and sorry for the noise.

-Michael

Getting started with DIH

2009-11-08 Thread Michael Lackhoff

I would like to start using DIH to index some RSS-Feeds and mail folders

To get started I tried the RSS example from the wiki but as it is Solr
complains about the missing id field. After some experimenting I found
out two ways to fill the id:

- copyField source=link dest=id/ in schema.xml
This works but isn't very flexible. Perhaps I have other types of
records with a real id or a multivalued link-field. Then this solution
would break.

- Changing the id field to type uuid
Again I would like to keep real ids where I have them and not a random UUID.

What didn't work but looks like the potentially best solution is to fill
the id in my data-config by using the link twice:
  field column=link xpath=/RDF/item/link /
  field column=id   xpath=/RDF/item/link /
This would be a definition just for this single data source but I don't
get any docs (also no error message). No trace of any inserts whatsoever.
Is it possible to fill the id that way?

Another question regarding MailEntityProcessor
I found this example:
document
   entity processor=MailEntityProcessor
   user=someb...@gmail.com
   password=something
   host=imap.gmail.com
   protocol=imaps
   folders = x,y,z/
/document

But what is the dataSource (the enclosing tag to document)? That is, how
would a minimal but complete data-config.xml look like to index mails
from an IMAP server?

And finally, is it possible to combine the definitions for several
RSS-Feeds and Mail-accounts into one data-config? Or do I need a
separate config file and request handler for each of them?

-Michael

Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff

On 08.11.2009 17:03 Lucas F. A. Teixeira wrote:

 You have an example on using mail dih in solr distro

blushDon't know where my eyes were. Thanks!/blush

When I was at it I looked at the schema.xml for the rss example and it
uses link as UniqueKey, which is of course good, if you only have rss
items but not so good if you also plan to add other data sources.
So I am still interested in a good solution for my id problem:

 What didn't work but looks like the potentially best solution is to fill
 the id in my data-config by using the link twice:
  field column=link xpath=/RDF/item/link /
  field column=id   xpath=/RDF/item/link /
 This would be a definition just for this single data source but I don't
 get any docs (also no error message). No trace of any inserts whatsoever.
 Is it possible to fill the id that way?

and this one:

 And finally, is it possible to combine the definitions for several
 RSS-Feeds and Mail-accounts into one data-config? Or do I need a
 separate config file and request handler for each of them?

Thanks
-Michael

Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff

On 08.11.2009 16:56 Michael Lackhoff wrote:

 What didn't work but looks like the potentially best solution is to fill
 the id in my data-config by using the link twice:
   field column=link xpath=/RDF/item/link /
   field column=id   xpath=/RDF/item/link /
 This would be a definition just for this single data source but I don't
 get any docs (also no error message). No trace of any inserts whatsoever.
 Is it possible to fill the id that way?

Found the answer in the list archive: use TemplateTransformer:
  field column=link xpath=/RDF/item/link /
  field column=id   template=${slashdot.link} /

Only minor and cosmetic problem: there are brackets around the id field
(like [http://somelink/]). For an id this doesn't really matter but I
would like to understand what is going on here. In the wiki I found only
this info:
 The rules for the template are same as the templates in 'query', 'url'
 etc
but I couldn't find any info about those either. Is this documented
somewhere?

-Michael

Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff

On 09.11.2009 06:54 Erik Hatcher wrote:

 The brackets probably come from it being transformed as an array.  Try  
 saying multiValued=false on your field specifications.

Indeed. Thanks Erik that was it.

My first steps with DIH showed me what a powerful tool this is but
although the DIH wiki page might well be the longest in the whole wiki
there are so many mysteries left for the uninitiated. Is there any other
documentation I might have missed?

Thanks
-Michael

Re: Getting started with DIH

2009-11-08 Thread Michael Lackhoff

On 09.11.2009 08:20 Noble Paul നോബിള്‍ नोब्ळ् wrote:

 It just started of as a single page and the features just got piled up
 and the page just bigger.  we are thinking of cutting it down to
 smaller more manageable pages

Oh, I like it the way it is as one page, so that the browser full text
search can help. It is just that the features and power seem to grow
even faster than the wike page ;-)
E.g. I couldn't find a way how to add a second rss feed. I tried with a
second entity parallel to the slashdot one but got an exception:
java.io.IOException: FULL whatever that means, so I must be doing
something wrong but couldn't find a hint.

-Michael

How to import multiple RSS-feeds with DIH

2009-11-08 Thread Michael Lackhoff

[A new thread for this particular problem]

On 09.11.2009 08:44 Noble Paul നോബിള്‍ नोब्ळ् wrote:

 The tried and tested strategy is to post the question in this mailing
 list w/ your data-config.xml.

See my data-config.xml below. The first is the usual slashdot example
with my 'id' addition, the second a very simple addtional feed. The
second example works if I delete the slashdot-feed but as I said I would
like to have them both.

-Michael

dataConfig
  dataSource type=HttpDataSource /
document
  entity name=slashdot
pk=link
url=http://rss.slashdot.org/Slashdot/slashdot;
processor=XPathEntityProcessor
forEach=/RDF/channel | /RDF/item
transformer=TemplateTransformer,DateFormatTransformer

field column=source   xpath=/RDF/channel/title
commonField=true /
field column=source-link  xpath=/RDF/channel/link
commonField=true /
field column=subject  xpath=/RDF/channel/subject
commonField=true /

field column=titlexpath=/RDF/item/title /
field column=link xpath=/RDF/item/link /
field column=id   template=${slashdot.link} /
field column=description  xpath=/RDF/item/description /
field column=creator  xpath=/RDF/item/creator /
field column=item-subject xpath=/RDF/item/subject /

field column=slash-department xpath=/RDF/item/department /
field column=slash-sectionxpath=/RDF/item/section /
field column=slash-comments   xpath=/RDF/item/comments /
field column=date xpath=/RDF/item/date
dateTimeFormat=-MM-dd'T'hh:mm:ss /
  /entity
  entity name=heise
pk=link
url=http://www.heise.de/newsticker/heise.rdf;
processor=XPathEntityProcessor
forEach=/RDF/channel | /RDF/item
transformer=TemplateTransformer
field column=source   xpath=/RDF/channel/title
commonField=true /
field column=source-link  xpath=/RDF/channel/link
commonField=true /

field column=titlexpath=/RDF/item/title /
field column=link xpath=/RDF/item/link /
field column=id   template=${heise.link} /
  /entity
/document
/dataConfig

Re: Preparing the ground for a real multilang index

2009-07-08 Thread Michael Lackhoff

On 08.07.2009 00:50 Jan Høydahl wrote:

 itself and do not need to know the query language. You may then want  
 to do a copyfield from all your text_lang - text for convenient one- 
 field-to-rule-them-all search.

Would that really help? As I understand it, copyfield takes the raw, not
yet analyzed field value. I cannot see yet the advantage of this
text-field over the current situation with no text_lang fields at all.
The copied-to text field has to be language agnostic with no stemming at
all, so it would miss many hits. Or is there a way to combine many
differently stemmed variants into one field to be able to search against
all of them at once? That would be great indeed!

-Michael

EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Michael Lackhoff

In Germany we have a strange habbit of seeing some sort of equivalence
between Umlaut letters and a two letter representation. Example 'ä' and
'ae' are expected to give the same search results. To achieve this I
added this filter to the text fieldtype definition:
filter class=solr.PatternReplaceFilterFactory
pattern=ä replacement=ae replace=all
/
to both index and query analyzers (and more for the other umlauts).

This works well when I search for a name (a word not stemmed) but not
e.g. with the word Wärme.
search for 'wärme' works
search for 'waerme' does not work
search for 'waerm' works if I move the EnglishPorterFilterFactory after
the PatternReplaceFilterFactory.

DebugQuery for waerme gives a parsedquery FS:waerm.
What I don't understand is why the (existing) records are not found. If
I understand it right, there should be 'waerm' in the index as well.

By the way, the reason why I keep the EnglishPorterFilterFactory is that
the records are in many languages and the English stemming gives good
results in many cases and I don't want (yet) to multiply my fields to
have language specific versions.
But even if the stemming is not right because the language is not
English I think records should be found as long as the analyzers are the
same for index and query.

This is with Solr 1.3.

Can someone shed some light on what is going on and how I can achieve my
goal?

-Michael

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Michael Lackhoff

On 02.07.2009 16:34 Walter Underwood wrote:

 First, don't use an English stemmer on German text. It will give some odd
 results.

I know but at the moment I only have the choice between no stemmer at
all and one stemmer and since more than half of the records are English
(about 60% English, 30% German, some Italian, French and others) the
results are not too bad.

 Are you using the same conversions on the index and query side?

Yes, index and query look exactly the same. That is what I don't
understand. I am not complaining about a misbehaving stemmer, unless it
does already something odd with the umlauts.

 The German stemmer might already handle typewriter umlauts. If it doesn't,
 use the pattern replace factory. You will also need to convert ß to ss.

That is what I tried. And yes I also have a filter for ß to ss. It
just doesn't work as expected.

 You really do need separate fields for each language.

Eventually. But now I have to get ready really soon with a small
application and people don't find what they expect.

 Handling these characters is language-specific. The typewriter umlaut
 conversion is wrong for English. It is correct, but rare, to see a diaresis
 in English when vowels are pronounced separately, like coöperate. In
 Swedish, it is not OK to convert ö to another letter or combination
 of letters.

It is just for German users and at the moment it would be totally ok to
have coöperate indexed as cooeperate, I know it is wrong and it will
be fixed but given the tight schedule all I want at the moment is the
combination of some stemming (perhaps 70% right or more) and typewriter
umlauts (perhaps 90% correct, you gave examples for the missing 10%).

Do I have any chance?

-Michael

Re: EnglishPorterFilterFactory and PatternReplaceFilterFactory

2009-07-02 Thread Michael Lackhoff

On 02.07.2009 17:28 Erick Erickson wrote:

 I'm shooting a bit in the dark here, but I'd guess that these are
 actually understandable results.

Perhaps not too much in the dark

 That is your implicit assumption, it seems to me, is that'wärme'  and
 'waerme' should go through the stemmer and
 become 'wärm'  and 'waerm', that you can then do the substitution
 on and produce the same output. I don't think that's a valid
 assumption.

Sounds very reasonable. Will see what I can make out of all this to keep
our librarians happy...

Yonik Seeley wrote:

 Also, check out MappingCharFilterFactory in Solr 1.4
 and mapping-ISOLatin1Accent.txt in example/solr/conf

Thanks for the hint, looking forward to the 1.4 release ;-) at the
moment we are on 1.3 though, I hope to upgrade soon but probably not
soon enough for this app.

-Michael

Preparing the ground for a real multilang index

2009-07-02 Thread Michael Lackhoff

As pointed out in the recent thread about stemmers and other language
specifics I should handle them all in their own right. But how?

The first problem is how to know the language. Sometimes I have a
language identifier within the record, sometimes I have more than one,
sometimes I have none. How should I handle the non-obvious cases?

Given I somehow know record1 is English and record2 is German. Then I
need all my (relevant) fields for every language, e.g. I will have
TITLE_ENG and TITLE_GER and both will have their respective stemmer. But
what with exotic languages? Use a catch all language without a stemmer?

Now a user searches for TITLE:term and I don't know beforehand the
language of term. Do I have to expand the query to something like
TITLE_ENG:term OR TITLE_GER:term OR TITLE_XY:term OR ... or is there
some sort of copyfield for analyzed fields? Then I could just copy all
the TITLE_* fields to TITLE and don't bother with the language of the query.

Are there any solutions that prevent an index with thousands of fields
and dozens of ORed query terms?

I know I will have to implement some better multilanguage support but
would also like to keep it as simple as possible.

-Michael

Re: Preparing the ground for a real multilang index

2009-07-02 Thread Michael Lackhoff

On 03.07.2009 00:49 Paul Libbrecht wrote:

[I'll try to address the other responses as well]

 I believe the proper way is for the server to compute a list of  
 accepted languages in order of preferences.
 The web-platform language (e.g. the user-setting), and the values in  
 the Accept-Language http header (which are from the browser or  
 platform).

All this is not going to help much because the main application is a
scientific search portal for books and articles with many users
searching cross-language. The most typical use case is a German user
searching multilingual. So we might even get the search multilingual,
e.g. TITLE:cancer OR TITLE:krebs. No way here to watch out for
Accept-headers or a language select field (would be left on any in
most cases). Other popular use cases are citations (in whatever
language) cut and pasted into the search field.

 Then you expand your query for surfing waves (say) to:
 - phrase query: surfing waves exactly (^2.0)
 - two terms, no stemming: surfing waves (^1.5)
 - iterate through the languages and query for stemmed variants:
- english: surf wav ^1.0
- german surfing wave ^0.9
- 
 - then maybe even try the phonetic analyzer (matched in a separate  
 field probably)

This is an even more sophisticated variant of the multiple OR I came
up with. Oh well...

 I think this is a common pattern on the web where the users, browsers,  
 and servers are all somewhat multilingual.

indeed and often users are not even aware of it, especially in a
scientific context they use their native tongue and English almost
interchangably -- and they expect the search engine to cope with it.

I think the best would be to process the data according to its language
but don't make any assumptions about the query language and I am totally
lost how to get a clever schema.xml out of all this.

Thanks everyone for listening and I am still open for good suggestions
to deal with this problem!

-Michael

Re: Moving from single core to multicore

2009-02-10 Thread Michael Lackhoff

On 10.02.2009 02:39 Chris Hostetter wrote:

 : Now all that is left is a more cosmetic change I would like to make:
 : I tried to place the solr.xml in the example dir to get rid of the
 : -Dsolr.solr.home=multicore for the start and changed the first entry
 : from core0 to solr and moved the core1 dir from multicore directly
 : under the example dir
 : Idea behind all this: Use the original single core under solr as core0
 : and add a second one on the same directory level (core1 parallel to
 : solr). Then I started solr with the old java -jar start.jar in the
 : example dir. But the multicore config seems to be ignored then, I get
 
 solr looks for conf/solr.xml relative the Solr Home Dir and if it 
 doesn't find it then it looks for conf/solrconfig.xml ... if you don't set 
 the solr.solr.home system property then the Solr Home Dir defaults to 
 ./solr/
 
 so putting your new solr.xml file in example/solr/conf should be what you 
 are looking for.

Almost. I had to change solr.xml like this, otherwise everything was
expected under ./solr looking for solr/solr and solr/core1:

  cores adminPath=/admin/cores
core name=core0 instanceDir=.
property name=dataDir value=./data /
/core
core name=core1 instanceDir=../core1
property name=dataDir value=../core1/data /
/core
  /cores

Though the dataDir property seems to be ignored, I had to set it in
solrconfig.xml of both cores.

Thanks for all your help, the support all of you are giving is really
outstanding!
--Michael

Moving from single core to multicore

2009-02-09 Thread Michael Lackhoff

Hello,

I am not that experienced but managed to get a Solr index going by
copying the example dir from the distribution (1.3 released version)
and changing the fields in schema.xml to my needs. As I said everything
is working very well so far.
Now I need a second index on the same machine and the natural solution
seems to be multicore (I would really like to keep the two distinct so I
didn't put everything in one index).
But I have some problems setting this up. As long as I try the multicore
sample everything works but when I copy my schema.xml into the
multicore/core0/conf dir I only get 404 error messages when I enter the
admin url.
Looks like I cannot just copy over a single core config to a multicore
environment and that is o.k., what I am missing is some guidance what to
look out for. What are the settings that have to be adjusted to
multicore? I would like to avoid trial and error for every single
setting I have in my config.

And a related question: I would like to keep the existing data dir as
core0-datadir (/path_to_installation/example/solr/data). Is this
possible with the dataDir parameter? And if yes, what would be the
correct value? /solr/data/ or
/path_to_installation/example/solr/data/? Do I need an absolute path
or is it relative to the dir where my start.jar is?

Thanks,
Michael

Re: Moving from single core to multicore

2009-02-09 Thread Michael Lackhoff

On 09.02.2009 15:40 Ryan McKinley wrote:

 But I have some problems setting this up. As long as I try the  
 multicore
 sample everything works but when I copy my schema.xml into the
 multicore/core0/conf dir I only get 404 error messages when I enter  
 the
 admin url.
 
 what is the url you are hitting?
those from the wiki: http://localhost:8983/solr/core0/select?q=*:*
 Do you see links from the index page?
Sorry, I don't know what you mean by this

 Are there any messages in the log files?

This looks like the key. The output is a bit difficult to follow but I
found the most likely reason: the txt files were missing (stopwords.txt,
synonyms.txt ...) and then the fieldtype definitions failed. After I
copied the complete conf dir over to multicore it is almost working now.

Only problems: I get this warning:
2009-02-09 16:27:31.177::WARN:  /solr/admin/
java.lang.IllegalStateException: STREAM
at org.mortbay.jetty.Response.getWriter(Response.java:571)
[lots more]

and both cores seem to reference the old single core data. If I do a
search both give (the same) results (from the old core), I expected them
to be empty, searching in a newly created index somewhere below the
multicore dir.

I couldn't find a datadir definition so I still don't know how to add a
real second core (not just two cores with the same data).

Any ideas?

Thanks so far
Michael

Re: Moving from single core to multicore

2009-02-09 Thread Michael Lackhoff

On 09.02.2009 17:01 Ryan McKinley wrote:

 Check your solrconfig.xml  you probably have somethign like this:
 
!-- Used to specify an alternate directory to hold all index data
 other than the default ./data under the Solr home.
 If replication is in use, this should match the replication  
 configuration. --
dataDir${solr.data.dir:./solr/data}/dataDir
 (from the example)
 
 either remove that or make each one point to the correct location

Thanks, that's it!

Now all that is left is a more cosmetic change I would like to make:
I tried to place the solr.xml in the example dir to get rid of the
-Dsolr.solr.home=multicore for the start and changed the first entry
from core0 to solr and moved the core1 dir from multicore directly
under the example dir
Idea behind all this: Use the original single core under solr as core0
and add a second one on the same directory level (core1 parallel to
solr). Then I started solr with the old java -jar start.jar in the
example dir. But the multicore config seems to be ignored then, I get
my old single core e.g. http://localhost:8983/solr/core1/select?q=*:* is
no longer found.
As I said everything works if I leave it in the multicore subdir and
start with -Dsolr.solr.home=multicore but it would be nice if I could
do without that extra subdir and the extra start parameter.

--Michael

Re: date range query performance

2008-10-31 Thread Michael Lackhoff

On 31.10.2008 19:16 Chris Hostetter wrote:

 forteh record, you don't need to index as a StrField to get this 
 benefit, you can still index using DateField you just need to round your 
 dates to some less graunlar level .. if you always want to round down, you 
 don't even need to do the rounding yourself, just add /SECOND 
 or /MINUTE or /HOUR to each of your dates before sending them to solr.  
 (SOLR-741 proposes adding a config option to DateField to let this be done 
 server side)

Is this also possible for the timestamp that is automatically added to
all new/updated docs? I would like to be able to search (quickly) for
everything that was added within the last week or month or whatever. And
because I update the index only once a day a granuality of /DAY (if that
exists) would be fine.

- Michael

Re: date range query performance

2008-10-31 Thread Michael Lackhoff

On 01.11.2008 06:10 Erik Hatcher wrote:

 Yeah, this should work fine:
 
 field name=timestamp type=date indexed=true stored=true  
 default=NOW/DAY multiValued=false/

Wow, that was fast, thanks!

-Michael

Re: Searching for future or null dates

2008-09-26 Thread Michael Lackhoff

On 26.09.2008 06:17 Chris Hostetter wrote:

 that's true, regretably there is no prefix operator to indicate a SHOULD 
 clause in the Lucene query langauge, so if you set the default op to AND 
 you can't then override it on individual clauses.
 
 this is one of hte reasons i never make the default op AND.

Just for symmetry or to get rid of this restriction wouldn't it be a
good idea to add such a prefix operator?

 i'm sure your food will still taste pretty good :)

That's what my wife keeps telling me ;-)

Many thanks. I think I will leave it as is for the current application
but use OR-Default plus prefix operators for new projects.

-Michael

Re: Searching for future or null dates

2008-09-23 Thread Michael Lackhoff

On 23.09.2008 00:30 Chris Hostetter wrote:

 : Here is what I was able to get working with your help.
 : 
 : (productId:(102685804)) AND liveDate:[* TO NOW] AND ((endDate:[NOW TO *]) OR
 : ((*:* -endDate:[* TO *])))
 : 
 : the *:* is what I was missing.
 
 Please, PLEASE ... do yourself a favor and stop using AND and OR ...  
 food will taste better, flowers will smell fresher, and the world will be 
 a happy shinny place...
 
 +productId:102685804 +liveDate:[* TO NOW] +(endDate:[NOW TO *] (*:* 
 -endDate:[* TO *]))

I would also like to follow your advice but don't know how to do it with
defaultOperator=AND. What I am missing is the equivalent to OR:
AND: +
NOT: -
OR: ???
I didn't find anything on the Solr or Lucene query syntax pages. If
there is such an equivalent then I guess the query would become:
productId:102685804 liveDate:[* TO NOW] (endDate:[NOW TO *] OR(*:*
-endDate:[* TO *]))

I switched to the AND-default because that is the default in my web
frontend so I don't have to change logic. What should I do in this
situation? Go back to the OR-default?

It is not so much this example I am after but I have a syntax translater
in my application that must be able to handle similar expressions and I
want to keep it simple and still have tasty food ;-)

-Michael

Re: wildcard newbie question

2008-01-30 Thread Michael Lackhoff


On 31.01.2008 00:31 Alessandro Senserini wrote:

I have a text field type called courseTitle and it contains 


Struts 2

If I search courseTitle:strut*  I get the documents but if I search with
courseTitle:struts* I do not get any results.

Could you please explain why?


Just a guess: It might be because of stemming. Do you have the same 
effect with words that don't end in an 's' or similar?

If my guess is correct, only 'strut' is in the index, not 'struts'.

-Michael

Out of heap space with simple updates

2008-01-23 Thread Michael Lackhoff

I wanted to try to do the daily update with XML updates (was mentioned 
recently as the recommended way) but got an OutOfMemoryError: Java heap 
space after 319000 records.
I am sending one document at a time through the http update interface, 
so every request should be short enough to not run out of memory.
Do I have to commit after every few thousand records to avoid the error? 
My understanding was that I have to do a commit only at the very end. Or 
are there other things I could try?
How can I increase the heap size? I use the included jetty and start 
solr with java -jar start.jar.

After I ran into the error a commit wasn't possible either.

What is the best way to avoid this sort of problems?

Thanks
-Michael

Re: Out of heap space with simple updates

2008-01-23 Thread Michael Lackhoff


On 23.01.2008 20:57 Chris Harris wrote:


I'm using

java -Xms512M -Xmx1500M -jar start.jar



Thanks! I did see the -X... params in recent threads but didn't know 
where to place them -- not being a java guy at all ;-)


-Michael

Re: Another text I cannot get into SOLR with csv

2008-01-08 Thread Michael Lackhoff

After a long weekend I could do a deeper look into this one and it looks 
as if the problem has to do with splitting.



This one works for me fine.

$ cat t2.csv
id,name
12345,'s-Gravenhage
12345,'s-Gravenhage
12345,s-Gravenhage

$ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
@t2.csv -H 'Content-type:text/csv; charset=utf-8'


My csv-file:
DBRECORDID,PUBLPLACE
43298,'s-Gravenhage

The URL (giving a 400 error):
http://localhost:8983/solr/update/csv?f.PUBLPLACE.split=truecommit=true;
(PUBLPLACE is defined as multivalued field)

If I remove the f.PUBLPLACE.split=true parameter OR make sure that the 
apostrophe is not the first character, everything is fine.
But I need the field to be multivalued and thus need the split parameter 
(not for this record but for others) and as the example shows, some have 
an apostrophe as the first character. Any ideas how to deal with this?


-Michael

Re: Another text I cannot get into SOLR with csv

2008-01-08 Thread Michael Lackhoff


On 08.01.2008 16:11 Yonik Seeley wrote:


Ahh, wait, it looks a single quote as the encapsulator for split field
values by default.
Try adding f.PUBLPLACE.encapsulator=%00
to disable the encapsulation.


Hmm. Yes, this works but:
- I didn't find anything about it in the docs (wiki). On the contrary
  it suggests that the single quote has to be explicitly set:
  f.tags.encapsulator='

(http://wiki.apache.org/solr/UpdateCSV?#head-c238cb494f800d345766acda16e08d82663127ce)
- A literal encapsulator should be possible to add by doubling
  it ' = '' but this gives the same error
- is it possible to change the split field separator for all fields? The
  URL is getting rather long already.

Re: correct escapes in csv-Update files

2008-01-04 Thread Michael Lackhoff

On 03.01.2008 17:16 Yonik Seeley wrote:

 CSV doesn't use backslash escaping.
 http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm
 
 This is text with a quoted string

Thanks for the hint but the result is the same, that is, quoted
behaves exactly like \quoted\:
- both leave the single unescaped quote in the record: quoted
- both have the problem with a backslash before the escaped quote:
  This is text with a \quoted string gives an error invalid
  char between encapsualted token end delimiter.

So, is it possible to get a record into the index with csv that
originally looks like this?:
This is text with an unusual \combination of characters

A single quote is no problem: just double it ( - ).
A single backslash is no problem: just leave it alone (\ - \)
But what about a backslash followed by a quote (\ - ???)

-Michael

Re: Another text I cannot get into SOLR with csv

2008-01-04 Thread Michael Lackhoff

On 04.01.2008 16:55 Yonik Seeley wrote:

 On Jan 4, 2008 10:25 AM, Michael Lackhoff [EMAIL PROTECTED] wrote:
 If the fields value is:
 's-Gravenhage
 I cannot get it into SOLR with CSV.
 
 This one works for me fine.
 
 $ cat t2.csv
 id,name
 12345,'s-Gravenhage
 12345,'s-Gravenhage
 12345,s-Gravenhage
 
 $ curl http://localhost:8983/solr/update/csv?commit=true --data-binary
 @t2.csv -H 'Content-type:text/csv; charset=utf-8'

But you are cheating ;-) This works for me too but I am using a local
csv file for the update:
http://localhost:8983/solr/update/csv?stream.file=t2.csvseparator=%09f.SIGNATURE.split=truecommit=true

Perhaps the problem is that I cannot define a charset for the stream.file?

-Michael

56 matches

Mail list logo