Re: update a document without changing anything

2016-11-26 Thread Ishan Chattopadhyaya
Maybe do an "inc" of 0 to a numeric field for every document.
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

On Wed, Nov 23, 2016 at 2:13 PM, Dorian Hoxha 
wrote:

> Hello searcherers,
>
> So, I have document that is fully stored. Then I make small change in
> schema. And now I have to reinsert every document. But I'm afraid of doing
> a get+insert, because something else may change the document in the
> meantime. So I want to do an "update" of nothing, so internally on the
> master-shard, the document is updated without changes. Maybe an update with
> no modifiers ?
>
> Thank You!
>


Re: stream, features and train

2016-11-26 Thread Joel Bernstein
Hi,

It looks like the outcome field my not be correct or it may have missing
values. You'll need to populate this field for all records in the training
set.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Nov 23, 2016 at 3:21 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi - I'm trying to experiment with the new train, features, model,
> classify capabilities of Solr 6.3.0.  I'm following along on:
> https://cwiki.apache.org/confluence/display/solr/Streaming+
> Expressions#StreamingExpressions-StreamSources
>
> When I execute:
> features(UNCLASS,
> q="*:*",
> featureSet="JoeFeature1",
> field="Title",
> outcome="Out",
> numTerms=250)
>
> Title is defined like:
> 
>
> Is this the correct syntax?  I'm getting an error:
>
> {
>   "result-set": {
> "docs": [
>   {
> "EXCEPTION": "java.util.concurrent.ExecutionException:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error from server at http://cressida:9100/solr/UNCLASS_shard2_replica2:
> java.lang.NullPointerException\n\tat org.apache.solr.search.IGainTe
> rmsQParserPlugin$IGainTermsCollector.collect(IGainTermsQParserPlugin.java:129)\n\tat
> org.apache.lucene.search.MatchAllDocsQuery$1$1.score(MatchAllDocsQuery.java:56)\n\tat
> org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)\n\tat
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:669)\n\tat
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473)\n\tat
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollecto
> rChain(SolrIndexSearcher.java:242)\n\tat org.apache.solr.search.SolrInd
> exSearcher.getDocListNC(SolrIndexSearcher.java:1803)\n\tat
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1620)\n\tat
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:617)\n\tat
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:531)\n\tat
> org.apache.solr.handler.component.SearchHandler.handleReques
> tBody(SearchHandler.java:295)\n\tat org.apache.solr.handler.Reques
> tHandlerBase.handleRequest(RequestHandlerBase.java:153)\n\tat
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> r(ServletHandler.java:1668)\n\tat org.eclipse.jetty.servlet.Serv
> letHandler.doHandle(ServletHandler.java:581)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> ContextHandler.java:1160)\n\tat org.eclipse.jetty.servlet.Serv
> letHandler.doScope(ServletHandler.java:511)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(
> ContextHandler.java:1092)\n\tat org.eclipse.jetty.server.handl
> er.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> ndle(ContextHandlerCollection.java:213)\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(
> HandlerCollection.java:119)\n\tat org.eclipse.jetty.server.handl
> er.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.
> succeeded(AbstractConnection.java:273)\n\tat org.eclipse.jetty.io
> .FillInterest.fillable(FillInterest.java:95)\n\tat org.eclipse.jetty.io
> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> .produceAndRun(ExecuteProduceConsume.java:246)\n\tat
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume
> .run(ExecuteProduceConsume.java:156)\n\tat org.eclipse.jetty.util.thread.
> QueuedThreadPool.runJob(QueuedThreadPool.java:654)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\n\tat
> java.lang.Thread.run(Thread.java:745)\n",
> "EOF": true,
> "RESPONSE_TIME": 10
>   }
> ]
>   }
> }
>
> Thank you!
>
> -Joe
>
>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Marek Ščevlík
I ran my jar application beside solr running instance where I want to
trigger a DIH import.
I tried this approach:

String urlString1 = "http://localhost:8983/solr/db/dataimport;;
SolrClient solr1 = new HttpSolrClient.Builder(urlString1).build();
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("command", "full-import");
SolrRequest request = new QueryRequest(params);
solr1.request(request);

.. and it returns now:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://localhost:8983/solr/db/dataimport: Expected mime type
application/octet-stream but got text/html. 


Error 404 Not Found

HTTP ERROR 404
Problem accessing /solr/db/dataimport/select. Reason:
Not Found



So I am still confused now ...

What do you think ? Any ideas?

I am trying to figure it out. Silly think is when I create a simple URL
call with the URL string used in those solr request objects and fire it off
in java it does the right desired thing.

Weird. I think.

Thanks for any replies or help.


2016-11-26 20:03 GMT+01:00 Marek Ščevlík :

> Actually to be honest I realized that I only needed to trigger a data
> import handler from a jar file. Previously this was done in earlier
> versions via the SolrServer object. Now I am thinking if this is OK?:
>
> String urlString1 = "http://localhost:8983/solr/;;
> SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
>   
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("db","/dataimport");
> params.set("command", "full-import");
> System.out.println(params.toString());
> QueryResponse qresponse1 = solr1.query(params);
>
> System.out.println("response = " + qresponse1);
>
> Output i get from this is: response = {responseHeader={status=0,
> QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},
> response={numFound=0,start=0,docs=[]}}
>
> There is a core db which come with the examples in solr 6.3 package. It is
> loaded. From web ui admin I can operate it a run the dih reindex process.
>
> I wonder whether this could work ? What do you think? I am trying to call
> DIH whilst solr is running. This code is in a separate jar file that is run
> besides solr instance.
>
> This so far is not working for me. And I wonder why? What do you think?
> Should this work at all? OR perhaps someone else could help out.
>
>
> Thanks anyone for any help.
> 
>
> 2016-11-25 19:50 GMT+01:00 Marek Ščevlík :
>
>> I forgot to mention I am creating a jar file beside of a running solr 6.3
>> instance to which I am hoping to attach with java via the
>> SolrDispatchFilter to get at the cores and so then I could work with
>> data in code.
>>
>>
>> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>>
>>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>>> release of Solr 6.3 to get hold of a running instance of the jetty server
>>> that is part of the solution? I found some code for previous versions where
>>> it was captured with this code and one could then obtain cores for a
>>> running solr instance ...
>>>
>>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>>
>>> .getDispatchFilter().getFilter();
>>>
>>>
>>> I was trying to implement it this way but that is not working out very
>>> well now. I cant seem to get the jetty server object for the running
>>> instance. I tried several combinations but none seemed to work.
>>>
>>> Can you perhaps point me in the right direction?
>>>
>>> Perhaps you may know more than I do at the moment.
>>>
>>>
>>> Any help would be great.
>>>
>>>
>>> Thanks a lot
>>> Regards Marek Scevlik
>>>
>>>
>>>
>>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>>> daniel.da...@nih.gov>:
>>>
 Marek,

 I've wanted to do something like this in the past as well.  However, a
 rewrite that supports the same XML syntax might be better.   There are
 several problems with the design of the Data Import Handler that make it
 not quite suitable:

 - Not designed for Multi-threading
 - Bad implementation of XPath

 Another issue is that one of the big advantages of Data Import Handler
 goes away at this point, which is that it is hosted within Solr, and has a
 UI for testing within the Solr admin.

 A better open-source Java solution might be to connect Solr with Apache
 Camel - http://camel.apache.org/solr.html.

 If you are not tied absolutely to pure open-source, and freemium
 products will do, then you might look at Pentaho Spoon and Kettle.
  Although Talend is much more established in the market, I find Pentaho's
 XML-based ETL a bit easier to integrate as a developer, and unit test and
 such.   Talend does better when you have a full infrastructure set up, but
 then the attention required to unit tests and Git 

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Erick Erickson
on a quick glance, and not having tried this myself...

this seems wrong. You're setting a URL parameter "db":
params.set("db","/dataimport");

that's equivalent to a URL like
http://localhost:8983/solr=/dataimport

you'd want:
http://localhost:8983/solr/db/dataimport?command=full-import

I think you want to set your url for your HTTPClient to
the full solr path to dataimport handler, i.e something like
...solr/collection_or_core/dataimport
then set the params for dataimport handler like you are, i.e.:
params.set("command", "full-import");

Best,
Erick

On Sat, Nov 26, 2016 at 11:03 AM, Marek Ščevlík
 wrote:
> Actually to be honest I realized that I only needed to trigger a data
> import handler from a jar file. Previously this was done in earlier
> versions via the SolrServer object. Now I am thinking if this is OK?:
>
> String urlString1 = "http://localhost:8983/solr/;;
> SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();
>
> ModifiableSolrParams params = new ModifiableSolrParams();
> params.set("db","/dataimport");
> params.set("command", "full-import");
> System.out.println(params.toString());
> QueryResponse qresponse1 = solr1.query(params);
>
> System.out.println("response = " + qresponse1);
>
> Output i get from this is: response =
> {responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}
>
> There is a core db which come with the examples in solr 6.3 package. It is
> loaded. From web ui admin I can operate it a run the dih reindex process.
>
> I wonder whether this could work ? What do you think? I am trying to call
> DIH whilst solr is running. This code is in a separate jar file that is run
> besides solr instance.
>
> This so far is not working for me. And I wonder why? What do you think?
> Should this work at all? OR perhaps someone else could help out.
>
>
> Thanks anyone for any help.
> 
>
> 2016-11-25 19:50 GMT+01:00 Marek Ščevlík :
>
>> I forgot to mention I am creating a jar file beside of a running solr 6.3
>> instance to which I am hoping to attach with java via the
>> SolrDispatchFilter to get at the cores and so then I could work with data
>> in code.
>>
>>
>> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>>
>>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>>> release of Solr 6.3 to get hold of a running instance of the jetty server
>>> that is part of the solution? I found some code for previous versions where
>>> it was captured with this code and one could then obtain cores for a
>>> running solr instance ...
>>>
>>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>>
>>> .getDispatchFilter().getFilter();
>>>
>>>
>>> I was trying to implement it this way but that is not working out very
>>> well now. I cant seem to get the jetty server object for the running
>>> instance. I tried several combinations but none seemed to work.
>>>
>>> Can you perhaps point me in the right direction?
>>>
>>> Perhaps you may know more than I do at the moment.
>>>
>>>
>>> Any help would be great.
>>>
>>>
>>> Thanks a lot
>>> Regards Marek Scevlik
>>>
>>>
>>>
>>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>>> daniel.da...@nih.gov>:
>>>
 Marek,

 I've wanted to do something like this in the past as well.  However, a
 rewrite that supports the same XML syntax might be better.   There are
 several problems with the design of the Data Import Handler that make it
 not quite suitable:

 - Not designed for Multi-threading
 - Bad implementation of XPath

 Another issue is that one of the big advantages of Data Import Handler
 goes away at this point, which is that it is hosted within Solr, and has a
 UI for testing within the Solr admin.

 A better open-source Java solution might be to connect Solr with Apache
 Camel - http://camel.apache.org/solr.html.

 If you are not tied absolutely to pure open-source, and freemium
 products will do, then you might look at Pentaho Spoon and Kettle.
  Although Talend is much more established in the market, I find Pentaho's
 XML-based ETL a bit easier to integrate as a developer, and unit test and
 such.   Talend does better when you have a full infrastructure set up, but
 then the attention required to unit tests and Git integration seems over
 the top.

 Another powerful way to get things done, depending on what you are
 indexing, is to use LogStash and couple that with Document processing
 chains.   Many of our projects benefit from having a single RDBMS view,
 perhaps a materialized view, that is used for the index.   LogStash does
 just fine here, pulling from the RDBMS and posting each row to Solr.  The
 hierarchical execution of Data Import Handler is very nice, but this can
 

Re: Metadata and Newline Characters at Content

2016-11-26 Thread Furkan KAMACI
PS: \n characters are not shown in browser but breaks how highlighter work.
 \n characters are considered at fragsize too.

On Sat, Nov 26, 2016 at 9:47 PM, Furkan KAMACI 
wrote:

> Hi Erick,
>
> I resolved my metadata problem with configuring solrconfig.xml However
> even I post data with post.sh I see content as like:
>
> CANADA �1 \n  \n \n   \n Place
>
> I have newline characters as \n and some non-ASCII characters. As far as I
> understand it is usual to have such characters because that is a pdf file
> and its newline characters are interpreted as *\n* at Solr. How can I
> remove them (\n and non-ASCII characters).
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Nov 24, 2016 at 8:58 PM, Erick Erickson 
> wrote:
>
>> Not sure. What have you tried?
>>
>>  For production situations or when you want to take total control of
>> the indexing process,I strongly recommend that you put the Tika
>> parsing on the _client_.
>>
>> Here's a writeup on this topic:
>>
>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI 
>> wrote:
>> > Hi Erick,
>> >
>> > When I check the *Solr* documentation I see that [1]:
>> >
>> > *In addition to Tika's metadata, Solr adds the following metadata
>> (defined
>> > in ExtractingMetadataConstants):*
>> >
>> > *"stream_name" - The name of the ContentStream as uploaded to Solr.
>> > Depending on how the file is uploaded, this may or may not be set.*
>> > *"stream_source_info" - Any source info about the stream. See
>> > ContentStream.*
>> > *"stream_size" - The size of the stream in bytes(?)*
>> > *"stream_content_type" - The content type of the stream, if available.*
>> >
>> > So, it seems that these may not be added by Tika, but Solr. Do you know
>> how
>> > to enable/disable this feature?
>> >
>> > Kind Regards,
>> > Furkan KAMACI
>> >
>> > [1] https://wiki.apache.org/solr/ExtractingRequestHandler
>> >
>> > On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> about PatternCaptureGroupFilterFactory. This isn't going to help. The
>> >> data you see when you return stored data is _before_ any analysis so
>> >> the PatternFactory won't be applied. You could do this in a
>> >> ScriptUpdateProcessorFactory. Or, just don't worry about it and have
>> >> the real app deal with it.
>> >>
>> >> I don't particularly know about the Tika settings, that's largely a
>> guess.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI > >
>> >> wrote:
>> >> > Hi Erick,
>> >> >
>> >> > 1) I am looking stored data via Solr Admin UI. I send the query and
>> check
>> >> > what is in content field.
>> >> >
>> >> > 2) I can debug the Tika settings if you think that this is not the
>> >> desired
>> >> > behaviour to have such metadata fields combined into content field.
>> >> >
>> >> > *PS: *Is there any solution to get rid of it except for
>> >> > using PatternCaptureGroupFilterFactory?
>> >> >
>> >> > Kind Regards,
>> >> > Furkan KAMACI
>> >> >
>> >> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> 1> I'm assuming when you "see" this data you're looking at the
>> stored
>> >> >> data, right? It's a verbatim copy of whatever you sent to the field.
>> >> >> I'm guessing it's a character-encoding mismatch between the source
>> and
>> >> >> what you use to display.
>> >> >>
>> >> >> 2> How are you extracting this data? There are Tika options I think
>> >> >> that can/do mush fields together.
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <
>> furkankam...@gmail.com>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content
>> field at
>> >> >> > schema has text_general field type which is not modified from
>> >> original. I
>> >> >> > do not copy any fields to content. When I check the data  I see
>> >> content
>> >> >> > values as like:
>> >> >> >
>> >> >> >  " \n \nstream_source_info MARLON BRANDO.rtf
>>  \nstream_content_type
>> >> >> > application/rtf   \nstream_size 13580   \nstream_name MARLON
>> >> BRANDO.rtf
>> >> >> > \nContent-Type application/rtf   \nresourceName MARLON
>> BRANDO.rtf   \n
>> >> >> \n
>> >> >> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
>> Desire\"
>> >> >> > directed by Elia Kazan \n"
>> >> >> >
>> >> >> > My questions:
>> >> >> >
>> >> >> > 1) Is it usual to have that newline characters?
>> >> >> > 2) Is it usual to have file metadata at the beginning of the
>> content
>> >> >> (i.e.
>> >> >> > stream source, stream_content_type) or related to tool that I post
>> >> data
>> >> >> to
>> >> >> > Solr?
>> >> >> >
>> >> >> > Kind Regards,
>> >> >> > Furkan KAMACI
>> >> >>
>> >>
>>
>
>


Re: Metadata and Newline Characters at Content

2016-11-26 Thread Furkan KAMACI
Hi Erick,

I resolved my metadata problem with configuring solrconfig.xml However even
I post data with post.sh I see content as like:

CANADA �1 \n  \n \n   \n Place

I have newline characters as \n and some non-ASCII characters. As far as I
understand it is usual to have such characters because that is a pdf file
and its newline characters are interpreted as *\n* at Solr. How can I
remove them (\n and non-ASCII characters).

Kind Regards,
Furkan KAMACI

On Thu, Nov 24, 2016 at 8:58 PM, Erick Erickson 
wrote:

> Not sure. What have you tried?
>
>  For production situations or when you want to take total control of
> the indexing process,I strongly recommend that you put the Tika
> parsing on the _client_.
>
> Here's a writeup on this topic:
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Nov 24, 2016 at 10:37 AM, Furkan KAMACI 
> wrote:
> > Hi Erick,
> >
> > When I check the *Solr* documentation I see that [1]:
> >
> > *In addition to Tika's metadata, Solr adds the following metadata
> (defined
> > in ExtractingMetadataConstants):*
> >
> > *"stream_name" - The name of the ContentStream as uploaded to Solr.
> > Depending on how the file is uploaded, this may or may not be set.*
> > *"stream_source_info" - Any source info about the stream. See
> > ContentStream.*
> > *"stream_size" - The size of the stream in bytes(?)*
> > *"stream_content_type" - The content type of the stream, if available.*
> >
> > So, it seems that these may not be added by Tika, but Solr. Do you know
> how
> > to enable/disable this feature?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > [1] https://wiki.apache.org/solr/ExtractingRequestHandler
> >
> > On Thu, Nov 24, 2016 at 6:51 PM, Erick Erickson  >
> > wrote:
> >
> >> about PatternCaptureGroupFilterFactory. This isn't going to help. The
> >> data you see when you return stored data is _before_ any analysis so
> >> the PatternFactory won't be applied. You could do this in a
> >> ScriptUpdateProcessorFactory. Or, just don't worry about it and have
> >> the real app deal with it.
> >>
> >> I don't particularly know about the Tika settings, that's largely a
> guess.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Nov 24, 2016 at 8:43 AM, Furkan KAMACI 
> >> wrote:
> >> > Hi Erick,
> >> >
> >> > 1) I am looking stored data via Solr Admin UI. I send the query and
> check
> >> > what is in content field.
> >> >
> >> > 2) I can debug the Tika settings if you think that this is not the
> >> desired
> >> > behaviour to have such metadata fields combined into content field.
> >> >
> >> > *PS: *Is there any solution to get rid of it except for
> >> > using PatternCaptureGroupFilterFactory?
> >> >
> >> > Kind Regards,
> >> > Furkan KAMACI
> >> >
> >> > On Thu, Nov 24, 2016 at 6:31 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> 1> I'm assuming when you "see" this data you're looking at the stored
> >> >> data, right? It's a verbatim copy of whatever you sent to the field.
> >> >> I'm guessing it's a character-encoding mismatch between the source
> and
> >> >> what you use to display.
> >> >>
> >> >> 2> How are you extracting this data? There are Tika options I think
> >> >> that can/do mush fields together.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >>
> >> >>
> >> >> On Thu, Nov 24, 2016 at 7:54 AM, Furkan KAMACI <
> furkankam...@gmail.com>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I'm testing Solr 4.9.1 I've indexed documents via it. Content
> field at
> >> >> > schema has text_general field type which is not modified from
> >> original. I
> >> >> > do not copy any fields to content. When I check the data  I see
> >> content
> >> >> > values as like:
> >> >> >
> >> >> >  " \n \nstream_source_info MARLON BRANDO.rtf
>  \nstream_content_type
> >> >> > application/rtf   \nstream_size 13580   \nstream_name MARLON
> >> BRANDO.rtf
> >> >> > \nContent-Type application/rtf   \nresourceName MARLON BRANDO.rtf
>  \n
> >> >> \n
> >> >> > \n  1. Vivien Leigh and Marlon Brando in \"A Streetcar Named
> Desire\"
> >> >> > directed by Elia Kazan \n"
> >> >> >
> >> >> > My questions:
> >> >> >
> >> >> > 1) Is it usual to have that newline characters?
> >> >> > 2) Is it usual to have file metadata at the beginning of the
> content
> >> >> (i.e.
> >> >> > stream source, stream_content_type) or related to tool that I post
> >> data
> >> >> to
> >> >> > Solr?
> >> >> >
> >> >> > Kind Regards,
> >> >> > Furkan KAMACI
> >> >>
> >>
>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-26 Thread Marek Ščevlík
Actually to be honest I realized that I only needed to trigger a data
import handler from a jar file. Previously this was done in earlier
versions via the SolrServer object. Now I am thinking if this is OK?:

String urlString1 = "http://localhost:8983/solr/;;
SolrClient solr1 = new HttpSolrClient.Builder(urlString).build();

ModifiableSolrParams params = new ModifiableSolrParams();
params.set("db","/dataimport");
params.set("command", "full-import");
System.out.println(params.toString());
QueryResponse qresponse1 = solr1.query(params);

System.out.println("response = " + qresponse1);

Output i get from this is: response =
{responseHeader={status=0,QTime=0,params={wt=javabin,version=2,db=/dataimport,command=full-import}},response={numFound=0,start=0,docs=[]}}

There is a core db which come with the examples in solr 6.3 package. It is
loaded. From web ui admin I can operate it a run the dih reindex process.

I wonder whether this could work ? What do you think? I am trying to call
DIH whilst solr is running. This code is in a separate jar file that is run
besides solr instance.

This so far is not working for me. And I wonder why? What do you think?
Should this work at all? OR perhaps someone else could help out.


Thanks anyone for any help.


2016-11-25 19:50 GMT+01:00 Marek Ščevlík :

> I forgot to mention I am creating a jar file beside of a running solr 6.3
> instance to which I am hoping to attach with java via the
> SolrDispatchFilter to get at the cores and so then I could work with data
> in code.
>
>
> 2016-11-25 19:31 GMT+01:00 Marek Ščevlík :
>
>> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
>> release of Solr 6.3 to get hold of a running instance of the jetty server
>> that is part of the solution? I found some code for previous versions where
>> it was captured with this code and one could then obtain cores for a
>> running solr instance ...
>>
>> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>>
>> .getDispatchFilter().getFilter();
>>
>>
>> I was trying to implement it this way but that is not working out very
>> well now. I cant seem to get the jetty server object for the running
>> instance. I tried several combinations but none seemed to work.
>>
>> Can you perhaps point me in the right direction?
>>
>> Perhaps you may know more than I do at the moment.
>>
>>
>> Any help would be great.
>>
>>
>> Thanks a lot
>> Regards Marek Scevlik
>>
>>
>>
>> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
>> daniel.da...@nih.gov>:
>>
>>> Marek,
>>>
>>> I've wanted to do something like this in the past as well.  However, a
>>> rewrite that supports the same XML syntax might be better.   There are
>>> several problems with the design of the Data Import Handler that make it
>>> not quite suitable:
>>>
>>> - Not designed for Multi-threading
>>> - Bad implementation of XPath
>>>
>>> Another issue is that one of the big advantages of Data Import Handler
>>> goes away at this point, which is that it is hosted within Solr, and has a
>>> UI for testing within the Solr admin.
>>>
>>> A better open-source Java solution might be to connect Solr with Apache
>>> Camel - http://camel.apache.org/solr.html.
>>>
>>> If you are not tied absolutely to pure open-source, and freemium
>>> products will do, then you might look at Pentaho Spoon and Kettle.
>>>  Although Talend is much more established in the market, I find Pentaho's
>>> XML-based ETL a bit easier to integrate as a developer, and unit test and
>>> such.   Talend does better when you have a full infrastructure set up, but
>>> then the attention required to unit tests and Git integration seems over
>>> the top.
>>>
>>> Another powerful way to get things done, depending on what you are
>>> indexing, is to use LogStash and couple that with Document processing
>>> chains.   Many of our projects benefit from having a single RDBMS view,
>>> perhaps a materialized view, that is used for the index.   LogStash does
>>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>>> hierarchical execution of Data Import Handler is very nice, but this can
>>> often be handled on the RDBMS side by creating a view, maybe using
>>> functions to provide some rows.   Many RDBMS systems also support
>>> federation and the import of XML from files, so that this brings XML
>>> processing into the picture.
>>>
>>> Hoping this helps,
>>>
>>> Dan Davis, Systems/Applications Architect (Contractor),
>>> Office of Computer and Communications Systems,
>>> National Library of Medicine, NIH
>>>
>>>
>>>
>>>
>>> -Original Message-
>>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>>> Sent: Friday, November 18, 2016 9:29 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Data Import Request Handler isolated into its own project - any
>>> suggestions?
>>>
>>> Hello. My name is Marek Scevlik.
>>>

Re: ClassicIndexSchemaFactory with Solr 6.3

2016-11-26 Thread Shawn Heisey
On 11/26/2016 10:58 AM, Furkan KAMACI wrote:
> I'm trying Solr 6.3. I don't want to use Managed Schema. It was OK for
> Solr 5.x. However solrconfig.xml of Solr 6.3 doesn't have a
> ManagedIndexSchemaFactory definition. Documentation is wrong at this
> point (
> https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig
> ) How can I use ClassicIndexSchemaFactory with Solr 6.3?

I believe that the managed schema is default now if you don't specify
the factory to use.  I checked basic_configs in 6.2.1 and that
definition did not appear to be present.  You'll probably have to *add*
the schema factory definition to the config.  It looks like it's a
top-level element, under .  It's only one line.

Thanks,
Shawn



ClassicIndexSchemaFactory with Solr 6.3

2016-11-26 Thread Furkan KAMACI
Hi,

I'm trying Solr 6.3. I don't want to use Managed Schema. It was OK for Solr
5.x. However solrconfig.xml of Solr 6.3 doesn't have a
ManagedIndexSchemaFactory definition. Documentation is wrong at this point (
https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig
)

How can I use ClassicIndexSchemaFactory with Solr 6.3?

Kind Regards,
Furkan KAMACI


org.apache.solr.common.SolrException: Bad contentType for search handler :text/xml; charset=UTF-8 request={q={!lucene}*:*=flat=true=json=all=false}

2016-11-26 Thread varun03sh
I am new to solr and trying to integrate solr to my php project through
solarium.
Solarium library version: 3.2.0 
Solr version : 6.3.0

On trying to ping solr I am getting 'Solarium\Exception\HttpException: Solr
HTTP error: OK (500)
{"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Ping
query caused exception: Bad contentType for search handler :text/xml;
charset=UTF-8
request={q={!lucene}*:*=flat=true=json=all=false}","trace":"org.apache.solr.common.SolrException:
Ping query caused exception: Bad contentType for search handler :text/xml;
charset=UTF-8
request={q={!lucene}*:*=flat=true=json=all=false}\r\n\tat
'

I started by creating a solr core and didn't do any modifications in the
core. Is there anything that's missed during my implementation?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/org-apache-solr-common-SolrException-Bad-contentType-for-search-handler-text-xml-charset-UTF-8-reque-tp4307494.html
Sent from the Solr - User mailing list archive at Nabble.com.