found a format that worked, kind of accidentally: "es.query" : """{"query":{"match_all":{}},"fields":["title","_source"]}"""
Thanks, Mohamed. On Tue, Dec 9, 2014 at 11:27 AM, Mohamed Lrhazi < mohamed.lrh...@georgetown.edu> wrote: > Thanks Nick... still no luck. > > If I use "?q=somerandomchars&fields=title,_source" > > I get an exception about empty collection, which seems to indicate it is > actually using the supplied es.query, but somehow when I do rdd.take(1) > or take(10), all I get is the id and an empty dict, apparently... maybe > something to do how my index is setup in ES ? > > In [19]: es_rdd.take(4) > 14/12/09 16:25:17 INFO SparkContext: Starting job: runJob at > PythonRDD.scala:300 > 14/12/09 16:25:17 INFO DAGScheduler: Got job 18 (runJob at > PythonRDD.scala:300) with 1 output partitions (allowLocal=true) > 14/12/09 16:25:17 INFO DAGScheduler: Final stage: Stage 18(runJob at > PythonRDD.scala:300) > 14/12/09 16:25:17 INFO DAGScheduler: Parents of final stage: List() > 14/12/09 16:25:17 INFO DAGScheduler: Missing parents: List() > 14/12/09 16:25:17 INFO DAGScheduler: Submitting Stage 18 (PythonRDD[30] at > RDD at PythonRDD.scala:43), which has no missing parents > 14/12/09 16:25:17 INFO MemoryStore: ensureFreeSpace(4776) called with > curMem=1979220, maxMem=278302556 > 14/12/09 16:25:17 INFO MemoryStore: Block broadcast_32 stored as values in > memory (estimated size 4.7 KB, free 263.5 MB) > 14/12/09 16:25:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage > 18 (PythonRDD[30] at RDD at PythonRDD.scala:43) > 14/12/09 16:25:17 INFO TaskSchedulerImpl: Adding task set 18.0 with 1 tasks > 14/12/09 16:25:17 INFO TaskSetManager: Starting task 0.0 in stage 18.0 > (TID 19, localhost, ANY, 24823 bytes) > 14/12/09 16:25:17 INFO Executor: Running task 0.0 in stage 18.0 (TID 19) > 14/12/09 16:25:17 INFO NewHadoopRDD: Input split: ShardInputSplit > [node=[VKgl4LAgRZyFaSopAWQL5Q/rap-es2-12|141.161.88.237:9200],shard=2] > 14/12/09 16:25:17 WARN EsInputFormat: Cannot determine task id... > 14/12/09 16:25:17 INFO PythonRDD: Times: total = 289, boot = 5, init = > 284, finish = 0 > 14/12/09 16:25:17 ERROR NetworkClient: Node [Socket closed] failed ( > 141.161.88.237:9200); selected next node [141.161.88.233:9200] > 14/12/09 16:25:17 INFO Executor: Finished task 0.0 in stage 18.0 (TID 19). > 1886 bytes result sent to driver > 14/12/09 16:25:17 INFO TaskSetManager: Finished task 0.0 in stage 18.0 > (TID 19) in 316 ms on localhost (1/1) > 14/12/09 16:25:17 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose > tasks have all completed, from pool > 14/12/09 16:25:17 INFO DAGScheduler: Stage 18 (runJob at > PythonRDD.scala:300) finished in 0.324 s > 14/12/09 16:25:17 INFO SparkContext: Job finished: runJob at > PythonRDD.scala:300, took 0.337848207 s > Out[19]: > [(u'en_20040726_fbis_116728340038', {}), > (u'en_20040726_fbis_116728320448', {}), > (u'en_20040726_fbis_116728330192', {}), > (u'en_20040726_fbis_116728330145', {})] > > In [20]: > > > > On Tue, Dec 9, 2014 at 10:18 AM, Nick wrote: > > try "es.query" something like "?q=*&fields=title,_source" for a match all >> query. you need the "q=*" which is actually the query part of the query >> >> On Tue, Dec 9, 2014 at 3:15 PM, Mohamed Lrhazi < >> mohamed.lrh...@georgetown.edu> wrote: >> >>> Hello, >>> >>> Following a couple of tutorials, I cant seem to get pysprak to get any >>> "fields" from ES other than the document id? >>> >>> I tried like so: >>> >>> es_rdd = >>> sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",conf={ >>> "es.resource" : "en_2004/doc","es.nodes":"rap-es2.uis","es.query" : >>> "?fields=title,_source" }) >>> >>> es_rdd.take(1) >>> >>> Always shows: >>> >>> Out[13]: [(u'en_20040726_fbis_116728340038', {})] >>> >>> How does one get more fields? >>> >>> >>> Thanks, >>> Mohamed. >>> >> >> >